Notes from The History of Cartography, Vol 3: Cartography in the European Renaissance

The History of Cartography, Volume 3: Cartography in the European Renaissance

David Woodward

History of Cartography project

Overview

  • This is an enormous series. This single volume is over 2,000 pages, so this is going to be a slow read. I’m interested in this period of time because this is when maps transitioned from a kind of pictographic story to instruments of navigation. The snippets that follow reflect that focus.

Notes

  • The thematic essays raise important issues in the history of cartography that both set an agenda for future research on Renaissance maps and take stock of the growing role of cartography as a way to organize social, political, and cultural space. These essays are meant to be thought provoking, rather than exhaustive, and reflect some of the multilayered approaches that the study of maps has adopted in the past two decades. They show how the authority of maps became an essential factor in influencing the ways in which Renaissance Europeans saw and imagined the geographic layout, order, and substance of the world – with “world” meaning not only an external object to be represented, but also a stage on which internal human aspirations could be played out.  (p. xxxix)
  • The investigation of how maps were conceived, made, and used in this period provides a case study highlighting some of these historiographical issues in a new way. Indeed it is surprising that Burckhardt completely ignored these cartographic aspects even when stressing the importance of the discovery of the world and its relationship to the discovery of the self, both topics on which the history of cartography has much to say. (p. 6)

Table (p. 7)

  • In Volume 1 of this History, the point was made that the word mappa or mappamundi in the Middle Ages could be used to describe either a text or a map. This practice continued into the sixteenth and seventeenth centuries, as with Sebastian Munster’s Mappa Evropae (Frankfurt, 1537), John Smith’s A Map of Virginia (Oxford, 1612), or Thomas Jenner’s A Map of the Whole World (London, 1668). Indeed the metaphorical use of the word “map” to describe not only geographical descriptions but also other activities has exploded even in our own day, as we hear almost daily of the “road map” to peace in the Middle East. (p. 7)
  • Likewise, the classical and medieval written land itineraries continued to be a robust tool for wayfinding, and these were by no means replaced by their graphic equivalents. Although we have a famous example of an assemblage of graphic and written itineraries in the Tabula Peutingeriana, an image whose pedigree goes back to the fourth century, written directions of how to get from one place to another predominated over maps in the medieval period. One may even question the extent to which graphic itineraries were actually used on the road. (p. 8)
  • Finally, textual sailing directions, known as periploi in classical times and portolans (portolani) in the Middle Ages, continued to be favored by many sailors over their graphic equivalents into the sixteenth and seventeenth centuries, particularly in northern European waters, where they became known as rutters. The confusion still persists today, as the term “portolan” is often used when “portolan chart” is intended, leading some to propose that the term be abolished altogether. As Fernandez-Armesto argues in this volume, maps and charts were not used for navigation in the Renaissance as much as -written sailing directions. (p. 8)
  • It was not derived from Ptolemy’s Geography, for Ptolemy stressed that local maps (chorographies) should not be based on measurement, but should instead be made by artists. (p. 10)
  • Between 1400 and 1472, in the manuscript era, it has been estimated that there were a few thousand maps in circulation; between 1472 and 1500, about 56,000; and between 1500 and 1600, millions. The significant increase in the sheer number of maps available for viewing calls for an explanation. Certainly maps began to serve a huge variety of political and economic functions in society. (p. 11)
  • The change in the abstract conception of space-from the center-enhancing mappaemundi to the Ptolemaic isotropic structure of mapmaking-has often been called the quintessential modernity of Renaissance cartography. The evidence for this lies in the relative scarcity of terrestrial maps bearing longitude and latitude before the fifteenth century. No terrestrial maps using longitude and latitude survive from thirteenth- and fourteenth-century Europe, despite Roger Bacon’s description of one on a sheepskin with cities shown by small red circles in the “Opus maius” (ca. 1265). In comparison, by the mid-seventeenth century, the observation of latitude and longitude as control points for topographical surveys had been introduced in France. What happened in the intervening four centuries is routinely ascribed to the rediscovery of Ptolemy’s manual of mapmaking in the first decade of the fifteenth century. (p. 12)
  • The notion of a bounded uniform space also implies that the objects placed in it are co-synchronous, a concept that, as we shall see, led to the idea that historical and “modern” maps could and should be separate documents. Since the surface is represented as a uniform space, scale and proportion are also possible. (p. 13)
  • Measurements of sufficient precision to take full advantage of the Ptolemaic paradigm were not available until astronomical measurements of latitude and longitude had become routine. (p. 13)
  • Geographic coordinates were thus mainly of scholarly and not practical concern until reliable astronomical measurements of both longitude and latitude became available in the late eighteenth century, after a satisfactory chronometer had been developed. Coordinates and projection grids certainly were powerful rhetorical devices in the fifteenth and sixteenth centuries, but the data behind them was often questionable. (p. 13)
  • The adoption of systematic map projections introduced a variety of centering and framing issues. The center of a projection did not usually imply either the author’s viewpoint or the most important feature to be portrayed. Unlike mappaemundi, in which Jerusalem, Delos, Rome, or some other holy place might be at the center of the map, a map such as Rosselli’s ovoid “World map” was centered on no particular place (the center is off the coast of modern Somaliland). What could be manipulated was the field of view of the projection. Since graduation in longitude and latitude forced the hand of the cartographer to some extent, the area to be covered by a projection had to be carefully calculated. (p. 14)
  • The tense of medieval mappaemundi usually covered a broad span of historical time. No strong distinction between a location and an event was drawn. Places that had once been important in history but no longer existed were shown side by side with currently important places. The map told a story, often a very long one. In the fifteenth and sixteenth centuries, as the atlas became a major genre, this storytelling role was still enormously important in maps. (p. 16)
  • The use of intersections of longitude and latitude that Ptolemy proposed as control points for mapmaking is not unlike the process by which a researcher gathers observations about the world and compares them against the framework of the laws of nature. It is not surprising that the map has been used as a metaphor for modern science. If “science” in the Renaissance meant the pursuit of knowledge about the natural world, the model of cartography built upon the cumulative observations of others. (p. 17)
  • An illustration of this approach to compilation using widely different sources is provided by Nicolaus Cusanus’s intriguing image of the cosmographer as creator, which we find in the Compendium, written in the year of his death, 1464. Nicolaus chose the metaphor of a cosmographer as a man positioned in a city with five gates, representing the five senses. Messengers bring him information about the world using these senses, and he records the information in order to have a complete record of the external world. He tries to keep all the gates open so as not to miss information gathered by any particular sense. When he has received all the information from the messengers, he “compiles it into a well-ordered and proportionally measured map lest it be lost.” He then shuts the gates, sends away the messengers, and turns to the map, meditating on God as the Creator who existed prior to the entire world, just as the cosmographer existed prior to the appearance of the map. Nicolaus concludes that, “in so far as he is a cosmographer, he is creator of the world,i” a carefully worded phrase whose sentiment would get cosmographers such as Gerardus Mercator and Andre Thevet into trouble with the church a century later. Nicolaus’s story illustrates the notion that by creating maps people saw, perhaps for the first time, that they could influence events and create worlds, that they could have the freedom to do things, rather than accept passively whatever God had ordained. Implicit in this passage is the realization that the world and the human representation of it were two different things. (p. 17)
  • Renaissance cartography has often been linked to the colonial and religious expansion of Europe. Mapping supported a sense of territorial self-entitlement that allowed religious and political leaders to claim vast areas of land overseas in the name of Christian European states. In Brian Harley’s words, “Maps were also inscriptions of political power. Far from being the innocent products of disinterested science, they acted in constructing the world they intended to represent …. Cartographic power was also a metaphor. It was expressed as imperial or religious rhetoric, as part of the creation ritual of taking possession of the land.” Such ceremonies of possession varied with the colonial power. The Portuguese relied on the abstract means of description, measured latitudes, to claim land. Their argument was that they had developed the technological knowledge to do so and hence had the right to wield it to their advantage. Mapping and surveying knowledge seem such an obvious form of evidence for colonial claims that their lack of treatment in some works is puzzling. (p. 19)
  • Ptolemy’s positive influence was far subtler, implying through a mathematization of the known inhabited world by means of longitude and latitude a measured-albeit faulty-estimate of what remained beyond the Greco-Roman inhabited world. Marco Polo’s book, on the other hand – even granted its author’s penchant for exaggeration – provided a narrative description of renewed trading possibilities with the East. Marco’s travels, in turn, were prompted by the Crusades (1096-1270), which enormously widened the geographical horizons of many classes of people, increased mobility, and fostered a culture of trade and travel. (p. 20)
  • Harris has tnade the point that cartography was a paradigmatic “big science” in the sense that it employed long-distance networks. He uses the concept of the “geography of knowledge,” by which he means the spatial connections between artifacts and people associated with a particular branch of knowledge, to explain how large corporations operated. He gives four examples, all of which have strong cartographic associations: the Casa de la Contratación de las Indias, the Consejo Real y Supremo de las Indias, the Verenigde Oostindische Compagnie (VOC), and the Society of Jesus. (p. 20)
  • Eisenstein’s thoughtful commentary on lvins’s dictum on the exactly repeatable pictorial statement was particularly welcome to historians of cartography as it used the example of printed maps to enlarge the context. She introduced the topic by stating that “the fact that identical images, maps and diagrams could be viewed simultaneously by scattered readers constituted a kind of communications revolution in itself.” Eisenstein’s view of the importance of printing for the cumulative gathering of information is echoed by Olson, whose general book on the implications of writing and reading unusually contains a section on maps. According to Olson, “The 600 or so maps which have survived from the period before 1300 show no sign of general developmental progression towards a comprehensive map of the world. The principal stumbling block to such a map was the lack of reliable means of duplicating maps, an obstacle overcome only with the invention of printing and engraving, and the invention of a common, mathematical, frame of reference which would permit the integration and synthesis of information being accumulated on the voyages of discovery.” (p. 21)
  • The concept of publishing did not depend on printing; Pliny the Younger refers to an “edition” of a thousand copies of a manuscript text. But when viewed as conveyors of information, Ivins and Eisenstein argue that the advantage of printed images lay more in the production of versions free from the corruption of the copyist, which could be used for comparative study. When map compilers had at their fingertips several standard printed sources of geographical data, such study was bound to benefit. As maps from different regions, scales, and epochs were brought into contact with each other in the course of compiling successive editions of atlases, contradictions became more visible, and divergent traditions more difficult to reconcile. (p. 21)
  • However, if one focuses not on the content of maps but on their economic role as consumer commodities, a different picture emerges. Here their graphic form as well as their function was important in establishing a holistic vision of the world. Such a vision of the general layout of countries and continents might not have been particularly accurate (a limitation that persists today not only in the general population but also in political leaders), but it engendered a culture of cosmopolitanism in a larger range of social classes. Geography also became an essential part of general education, and the accoutrements of the cartographer (surveying instruments, globe, and armillary spheres) became icons of learning. (p. 22)
  • One could maintain that the use of maps to plot observations lagged as much as the sacred uses of maps persisted. (p. 23)
  • This is not to say that profound changes in cartographic method and practice did not take place in the Renaissance. The fact that the abstract theory of geographical coordinates was accepted as a way to make maps was in itself a significant change, as was the construction of maps orthogonally, from an infinity of impossible human viewpoints in space. The implications of this geometric view of cartography for the centering, framing, and orientation of maps were far reaching in the public perception throughout the world. (p. 23)
  • Coincident with this new way of plotting data arose an awareness of the representation itself and of how it related to the world, or an awareness that representations of the world and the world -itself were two different things. This resulted in a greater reliance on or more thought given to using artificial codes in cartographic representation. (p.23)
  • The Middle Ages has been described as a period that “knew little of maps,” and indeed the number of surviving examples, even if allowances are made for what was probably an extremely high rate of loss, do not suggest that maps were produced and consumed in particularly large numbers between the fifth and fourteenth centuries. (p.25)
  • Although medieval maps often used to be described as copying a few standard models and repeating a tired assortment of information drawn from classical and biblical sources, it is becoming increasingly clear that they, like all other maps, should instead be understood as tools for thinking and as flexible means of communicating ideas. In the Middle Ages, as in other periods, maps could be shaped and manipulated to meet particular needs as their authors drew from graphic and textual traditions, from experience, and from their own ideas to create individual artifacts suited to given contexts. As Gautier Dalche has emphasized, maps, like other representations, do not inform us generally about contemporaries’ perceptions of space, but rather about the mental and technical tools available to the mapmaker. Medieval maps must, in short, be approached not as transparent windows into their creators’ and users’ minds but as rhetorically constructed documents belonging to specific times and specific contexts. Recent studies have emphasized the importance of exploring these contexts, whether the specific codicological context of a particular manuscript or the larger social and cultural setting in which the map was conceived, as essential to understanding the full meaning of a given map within its society. (p. 26)
  • One of the most influential contributions to the study of medieval cartography has been the idea that world maps were intended to describe time as well as space. Since the publication of two highly influential articles by von den Brincken on the close relationship between universal chronicles-those that attempted to sum up aJl of human history in one work . and world maps, it has been widely accepted that one function of these maps was to give an overview of the world, understood as the theater of human, and especially Christian, history. (p. 30)
  • Within their broad function as representations of space and time, world maps could serve a wide variety of more specific rhetorical needs. One way to explore the functions of the world map in medieval society is through the multivalent meanings of the world itself in the learned culture of the time. Part of the curiosity about the physical world that characterized the twelfth-century Renaissance was the desire to understand the earth as a part of a system. The concern among philosophers for the machina universitatis or the machina mundi led them to focus on the system underlying the universe and the laws that governed it. The details of the earth itself (terra, both the planet and the element earth) were of less interest to them than the grand mechanism of the world (mundus). Contrasted with this interest in the machina mundi was the equally vibrant idea of contemptus mundi (renunciation of the world), which drew on a related but different definition of the “world” to contrast the ascetic life with the life of ordinary secular affairs. “Secular” recalls the term saeculum that contrasted “the world of men and of time” with the eternal world of the Christian God. Between these extremes were the views of historians, pilgrims (whether armchair or actual), and other travelers, for which locations and events on the earth did matter and needed to be recalled. (p. 31)
  • Roger Bacon’s discussion of a figura or drawing showing major cities located according to their longitude and latitude. Bacon has in the past been credited with considerable innovations in geographical thought, most particularly in his understanding of the use of coordinates to create an accurate graphic representation of the world’s places. (p. 33)
  • Bacon was thus not unique in his interest in locating the places of the world accurately within a system that connected them to the heavens. (p. 34)
  • This was due in part to the heightened attention given in the twelfth century to the literal sense of biblical exegesis: understanding the names, places, and history described in the Bible was seen as the necessary foundation for examining other meanings (moral, Christological, or eschatological). (p. 34)
  • In conclusion, the surviving examples of world maps, along with other texts, images, and references to maps, bear witness to-the passionate interest in the real world described by Gautier Dalche. The variety of functions that these maps could play reflects the multifarious meanings of the world in medieval culture, as the maps served to describe, analyze, summarize, and create knowledge and perceptions about the fundamental spaces of human existence. These were works destined for both elite and somewhat more popular audiences-including pilgrims, parishioners, and consumers of romances-to whom they helped provide visual, intellectual, and imaginative access to the larger world. As we have seen, the sensitivity of recent scholarship to the specific contexts in which maps appeared and the ways in which they were used has given us new insights into the complexity and subtlety of the potential meanings of medieval world maps, although much remains to be uncovered about the perception and representation of space in this fertile period. (p.36)
  • It is, however, extremely important to remember the comminnent of time and resources involved in producing the medieval copy: the question then arises of what this map meant to the society that found the human and financial resources to copy it. It has been plausibly explained in the context of the strong interest in the classical world that we have already seen influencing the toponyms of later medieval world maps; however, more research should be done to elucidate the importance and the influence of this map. (p. 38)
  • In addition to these surviving examples of itinerary maps, the significance of the itinerary – especially the written or narrated itinerary – is demonstrated by the frequency with which itineraries served as at least one source for other types of maps. For example, some of the information on the Hereford world map was based on an itinerary that may show a route familiar to English traders in France.  Even more substantial is the role played by itineraries in the creation of regional maps. These interconnections are especially striking in the rich cartographic production of Matthew Paris, although, as we will see when we turn to earlier maps of Britain, his work is far from unique in this respect. (p.39)
  • It is possible to describe the sources and creation of Paris’s maps of England in considerable detail, an approach very welcome in the study of medieval cartography, thanks to a study by Harvey.  According to his reconstruction, Paris began by adopting the outline of the island from a world map, probably of Roman origin. He then drew on an itinerary from Dover to the Scottish border to develop his representation of the interior, filling in extra place-names around this core. His subsequent revisions of the map reflect his discoveries of new sources, providing the river network, for example, and improvements in the coastline. Collectively, these maps demonstrate how powerful a process the compilation of geographical information from various sources could be and how central a role itineraries and world maps could play in the elaboration of regional maps. (p. 39-40)
  • most maps were made to aid in understanding, not primarily to represent space in a geometrically correct way. As Delano-Smith and Gruber point out, “a diagram is the most appropriate style for any map used in explanation,” a dictum with which medieval cartographers would have agreed wholeheartedly. (p. 44)
  • Both the examples of the smooth incorporation of the portolan chart with the world map and its more tentative acceptance in the Aslake world map indicate that, in spite of regional limitations of access to these map forms, mapmakers were eager to adapt new cartographic information to their own purposes when it came their way. The readiness of even a hesitant northern mapmaker to adopt a radically new depiction of space suggests that, in the fourteenth century, the idea was becoming fairly widely accepted that world maps could and should contain at least some detailed topographic information in addition to the historical and toponymic information presented by earlier world maps. (p. 46)
  • If we turn to our second author and artist, Opicino de Canistris, we find a similar range of maps in the service of a very different project. Opicino was not writing history, with its well-known attention to the loci (places) in which historical events took place. Instead, he worked from the equally familiar idea of the created world as God’s book to develop an elaborate system for understanding and recognizing sin in the individual via an analysis of the places of his life as represented on maps. (p. 47)
  • I have argued at length elsewhere that [Opicino de Canistris] believed that maps were important because the very schematization of the image of the world that they proposed bridged the gap between the materialistic human imagination and man’s higher powers of reason. As such, maps, for Opicino, were a potential answer to the spiritual problems of his time and fitting tools for a priest concerned with analyzing and combating unbelief. (p. 48)
  • Far from a unified project, the mapmaking and map use of the late medieval and early Renaissance period reveals itself as abundant and chaotic growth as yet unpruned into the chaste mathematica topiary of seventeenth-century cartography. (p. 52)

A simple example of ensemble training

I’ve been using multilayer Perceptrons (MLPs) for some quickly trainable sequence-to-sequence time series predictions. The goal is to take sensor data from one day and use that as training data to predict the next day’s patterns. The application is extremely consistent, but the hardware slowly degrades. By retraining, the error detection system is able to “drift” with the system as various parts wear at different rates. And there are a lot of sensors – several thousand per system, so rapid training is a nice feature.

The problem that I was running into had to do with hyperparameter tuning. I would make a change or two, and then re-run the system on my well-characterized simulated data, and the accuracy of the result would change in odd ways. It was very frustrating.

As a way to work through more options in an automated way, I built an optimizer class using evolutionary algorithms (adjusting variables, rather than evolutionary programming, which evolves code). I could then fire up the evolver and try hundreds or thousands of models as the system worked to find the best fitness (in this case highest accuracy).

But there was a big problem, which I kind of knew about. The random initialization of weights makes a HUGE difference in the performance of the model. I discovered this while looking at the results of the evolver, which saves the best of each generation and saves them out to a spreadsheet:

If you look at row 8, you see a lovely fitness of 0.9, or 90%. Which was the best value from the evolver runs. However, after sorting on the parameters so that they were grouped, it became obvious that there is a HUGE variance in the results. The lowest fitness is 30%, and the average fitness for those values is actually 60%. I tried running the parameters on multiple trained models and got similar results. These values are all over the place. Not good.

To address this, I need to be able to run a population and get the distribution stats (mean, 5% and 95% confidence,  min, and max outliers). I can then sort on the mean, but also have insight into the variance. A good mean with wide variance may be worse than a slightly worse mean with tight variance.

So I added statistical tests to the evolver, based on this post, starting with the scikit-learn resample(). Here’s the important bits:

def calc_fitness_stats(self, resample_size:int = 100):
    boot = resample(self.population, replace=True, n_samples=resample_size, random_state=1)
    s = pd.Series(boot)
    conf = st.t.interval(0.95, len(boot)-1, loc=s.mean(), scale= st.sem(boot))
    self.meta_info = {'mean':s.mean(), '5_conf':conf[0], '95_conf':conf[1], 'max':s.max(), 'min':s.min()}
    self.fitness = s.mean()

To evaluate, I used my test landscape, a 3D surface, based on the equation z = cos(x) + sin(y) + (x + y)/10,   over the range (-5, 5). I also added some randomness to the x and y values to noise up the results so the statistics would show something. This worked well on my landscape as you can see below, so I integrated it into my hyperparameter tuner.

Before I go into the results, let me describe the whole data set – what it looks like in total, what the parts that we are trying to recognize, and the ground truth that we are training against:

Full Data Set: The data a set of mathematical functions. In this case, it’s a simple set of ten sin(x) waves of varying frequency. They all start at the same value, and evolve from there. The shortest wavelength is cyan, the longest is dark blue in the figure below. It’s a reasonable proxy for ten sensors that change over the course of a day, some quickly, some slowly:

Full_data

Training Set: I take the above dataset, which has 200 elements and split it in two. This creates a training set or input vector of 100 elements and an output, “ground truth” vector that the system will be trained to recognize. So ten shapes will be trained to map to ten other shapes in one MLP network:

Clean_input

Ground Truth: This is the 100 sample vectors that we will be training the network to produce:

All Predictions: If you take the first random result of the evolver, you will get ten models that are identical except for the initial weights. In this case, the hyperparameters are number of layers, neurons per layer, batch size and epochs. The evolver initially comes up with a population of ten random genomes (in specified ranges, like 10 – 1000 neurons, with a step of 10). It then keeps the five best “genomes” and breeds and mutates 5 more. New genomes are in turn run 10 times to produce the statistics. The models associated with the best values are saved.

If we look at one of the initial models, before any evolution optimization you can see why this approach is needed. Remember, This variation is based solely on the different random initialization  of the weights between layers. What you are looking at is the input vector being run through ten models that are used to calculate the statistical values of the ensemble. You can see that most values are pretty good, some are a bit off, and some are pretty bonkers.

Ensemble Average: On the whole though, if you take the average of all the ensemble, you get a pretty nice result. And, unlike the single-shot method of training, the likelihood that another ensemble produced with the same architecture will be the same is much higher.

Here’s the code to take the average:

        avg_mat = np.zeros(self.test_mat.shape)
        with os.scandir() as entries:
            count = 1
            for entry in entries:
                if entry.is_file() or entry.is_symlink():
                    os.remove(entry.path)
                elif entry.is_dir():
                    count += 1
                    print("loading: {}".format(entry.name))
                    new_model = tf.keras.models.load_model(entry.name)
                    self.predict_mat = new_model.predict(self.train_mat)
                    avg_mat = np.add(self.predict_mat, avg_mat)
        avg_mat = avg_mat / count

 

This is not to say that the model is perfect. The orange curve at the top of the last chart is too low. This model had a mean accuracy of 67%. But this is roughly equivalent to my initial hyperparameter guesses. Let’s see what happens after 50 generations.

Five hours and 5,000 evaluations later,  I have the full run of 50 generations. Things did get better. We end with a higher mean, but we also have a variance that does not steadily improve. This means that it’s possible that the architecture around generation 23 might actually be better:

Because all the values are saved in the spreadsheet, I can try those hyperparameters, but the system as I’ve written it only saves the “best” set of parameters. Let’s see what that best ensemble looks like as an ensemble when compared to the early run:

That is a lot better. All the related predictions are much closer to each other, and appear to be clustered around the right places. I am genuinely surprised how tidy the clustering is, based on the previous “All Predictions” plot towards the top of this post. On to the ensemble average:

That is extremely close to the “Ground Truth” chart. The orange line is in the right place, for example. The only error that I can see with a cursory visual inspection is that the height of the olive line is a little lower than it should be.

Now, I am concerned that there may be two peaks in this fitness landscape that we’re trying to climb. The one that we are looking for is a generalized model that can fit approximate curves. The other case is that the network has simply memorized the curves and will blow up when it sees something different. Let’s test that.

First, let’s revisit the training set. This model was trained with extremely clean data. The input is a sin function with varying frequencies, and the evaluation data is the same sin function, picking up where we cut off the training data. Here’s the clean data that was used to train the model:

Now let’s try noising that up, so that the model has to figure out what to do based on data that model has never seen before:

Let’s see what happened! First, let’s look at all the predictions from the ensemble:

The first thing that I notice is that it didn’t blow up. Although the paths from each model are somewhat different, each one got all the paths approximately right, and there is no wild deviation. The worst behavior (as usual?) is the orange band, and possibly the green band. But this looks like it should average well. Let’s take a look:

That seems pretty good. And the orange / green lines are in the right place. It’s the blue, olive, and grey lines that are a little low. Still, pretty happy with this.

So, ensembles seem to work very well, and make for resilient, predictable behavior in NN architectures. The cost is that there is much more time required to run many, many models through the system to determine which ensemble is right.

But if you want reproducible results, it’s a good way to go.

The Clockwork Muse

The Clockwork Muse: The Predictability of Artistic Change (1990)

Author (publications)

Colin Martindale (March 21, 1943 – November 16, 2008) was a professor of psychology at the University of Maine for 35 years.

Martindale wrote and did research analyzing artistic processes. His most popular work was The Clockwork Muse (1990). Martindale argued that all artistic development over time in written, visual and musical works was the result of a search for novelty.

Martindale was awarded the 1984 American Association for the Advancement of Science Prize for Behavioral Science Research.

Overview

Although he idea of Digital Humanities, or the quantitative analysis of the arts, is starting to make its presence felt in the last half of the 2010s, The Clockwork Muse predates all this by decades. Martindale was doing computational analysis of poetry, prose, visual arts, architecture, and even scientific writing in the 1980s!

The central premise of this book is that human creative output is driven by what he calls The Law of Novelty, which consist of two components:

  1.  Arousal potential
  2. Habituation, or exposure fatigue

His main premise is that to be successful, arousal potential must increase over time, and at a rate that the audience desires. Too little, and the audience will become fatigued and find novelty elsewhere. Too fast and the audience won’t comprehend. Artists that produce novelty at the right rate for their communities are the most successful. At the same time, artists behave according to the principle of least effort, and will take the easiest path to increasing arousal potential.

How art is created for human audiences is by varying the ratio of two approaches:

  1. Primordial content: This represents a “going back to basics” approach to art. Forms become simpler and less refined. Increases in primordial content typically occur at the beginning of a movement. Consider impressionism, where art went from highly refined representation to vaguer representations that incorporated a larger interpretive and emotional component.
  2. Stylistic change: This is the process that we see within an artistic movement, where the process is incrementally refined by the artists in the movement. A syntax is developed, and a progressively more sophisticated “conversation” emerges.

The contribution of these two approaches in art movements, careers, and even within sequential art such as novels and music vary inversely to one another in roughly sinusoidal patterns. While each value may decrease individually, the combination of the two increases.

Martindale then goes on to show that this theory holds in a multitude of contexts and studies. There are two general forms of the studies. In his text analytics, he analyzes variability/incongruity, and groups terms into “Primordial”/”Stylistic” buckets and then does statistical analysis on the results, generally finding good correlations. For the study of visual arts ranging from painting to architecture, he collects canonical images across a given timeframe (often in centuries), divides this span into segments, randomises the segments, and has subjects classify the images across multiple dimensions which in turn can be collapsed into the primordial/stylistic overarching categories.

Clockp70

My more theoretical thoughts

The dichotomy of primorial content / stylistic change matches Kaufman’s concepts of long jumps / hillclimbing on rough fitness landscapes, and also Iain McGilchrist’s interpretation of the roles of the left / right hemispheres of the brain in the The Master and his Emissary, where the right hemisphere operates in global, general contexts and the left is local and specific.

Stylistic change depresses the hill it is climbing. That’s why large jumps in the fitness space represented by increased primordial content work – Once a movement is exhausted, a small jump will only sink the terrain further. But a large jump is almost guaranteed to increase arousal potential.

People (and other organisms?) get bored with incremental progress below a certain rate (Habituation), and become biased towards long jumps that create change and provide novelty. There is survival value in this that is addressed in the explore/exploit paradox. This pressure for change at a certain rate is what drives much of our clustering behavior, which is based on alignment. Insufficient arousal builds up greater potential for nonlinear long jumps, or revolution.

Speed affects the ability to change direction. Changing direction stylistically can create more arousal potential than the linear extension of a particular technique (Though reversing direction should lower arousal potential). That means the faster that stylistic change is happening, the more likely that progress will be linear (realism -> superrealism -> hyperrealism, etc)

Notes

Chapter 1: A Scientific Approach to Art and Literature

  • As for history, the whole point of universal history is to find general laws that apply no matter what nations are involved or at what time. Because the topic is complex, these laws seem less deterministic than those governing objects falling in vacuums. History is, however, not completely chaotic and unpredictable. Just as in chemistry, certain combinations in certain situations will explode. Just as in physics, certain configurations in certain places will collapse. (Page 15)
  • Styles going by different names show obvious similarities: for example, eighteenth-century neoclassicism and fifteenth-century renaissance style, and fourteenth-century gothicism and nineteenth-century romanticism. Though the earlier style ~ did not directly cause the later style, we must explain the similarities. Perhaps art oscillates along a continuum so that such similarities are inevitable. If so, we must explain why it oscillates, and why along that continuum rather than along some other one. Of course, we must explain what the continuum is and why it exists. (Page 20)
  • The psychologist D. T. Campbell (1965) argues for direct application of the principles of Darwinian evolution to cha11ge in cultural systems and products. Sociocultural change, he says, is a product of “blind” variation and selective retention. The three necessities for evolution of any sort are: presence of variations, consistent selection criteria that favor sorne variants over others, and mechanisms for preserving the selected variants. At any time, a number of variants of a given object are produced, and the n1ost useful, pleasing or rewarding arc chosen for retention. Then, at the next point in tirnc, there is variation of the new form, and the process continues. Though such theories provide a general framework for thinking about aesthetic evolution, they do not tell us why aesthetic variation exists in the first place, nor were they proposed to do so. (Page 33)

Chapter 2: A Psychological Theory of Aesthetic Evolution

  • Theories concerning an inner logic driving change in the arts were anticipated by Herbert Spencer’s quasi-Darwinian theory. This English philosopher, in his major statement on art (1910 [1892]), set forth the principle that art, like everything else, moves from simple to complex. By complex, he meant more differentiated and hierarchically integrated. The anthropologist Alfred Kroeber (1956) followed Spencer in proposing such a simple-to-complex law, as did Kubler (1962). (Page 31)
  • If you drop an object in a vacuum, gravitation will cause it to move in a specific direction at a specific rate. Just so, if art were created in a social vacuum, pressure for novelty would cause it to evolve in a specific manner The empirical evidence suggests that art tends in fact to evolve in a social vacuum, and that non-evolutionary factors are comparatively negligible Though art is not produced in a complete social vacuum, I believe there to he more of a social vacuum than is commonly thought. Furthermore, social forces are analogous to friction, in that they impede or slow down the progress of an artistic tradition. (Page 34)
  • In the case of biological evolution, selected variations are, of course, encoded in the “memory” of DNA configurations. We may assume that more important or more complex DNA configurations are “forgotten” less quickly than less complex ones. The analogue is most certainly the case in the arts. If for no other reason than our educational system, the average reader of a contemporary British poet has some rudimentary “memory” of the poet’s predecessors at least back to the time of Chaucer. In comparison, the average person who purchases clothing has a poor memory for prior styles of fashion. As likely as not, such a person knows little about the styles of even thirty or forty years ago. The better “memory” of the poetic system leads us to expect that it should change in a different manner than fashion. In the latter, novelty seems often to be obtained by reviving a forgotten style, an option not as available to the poet. (Page 37)
  • …no one liked the Moonlight Sonata, he did not understand why the person was bothering with such chitchat. It did not concern Beethoven what people thought of that sonata. He did not write it for others. He wrote it for himself, and he liked it. As I have said, the notion that an artist is trying to communicate with an audience is misleading. It leads to an “audience-centric” confusion. (Page 38)
    • This book is about how Nomads work. High art has to be exclusively Nomad, and in most cases, drives the audience away. Pop art is different, because it cares about the audience.
  • A good deal of evidence supports the contcnhon that people prefer stimuli with a medium degree of arousal potential and do not like stimuli with either an extremely high or low arousal potential. This relationship, described by the Wundt curve, is borne out by several genera) studies reviewed by Theodore Schneirla ( 1959) and Berlyne ( 1967) as well as by studies 0f aesthetic stimuli per se. The effect has been found with both literary (Evans 1969; Kamann 1963) and visual (Day 1967; Vitz 1966) stimuli. There is some question about the shape of the Wundt curve (Martindale et al. 1990 ), but there is no question that people do like some degree of intensity, complexity, and so on. (Page 42)
  • WundtCurve
    • Hedonic tone and arousal potential. According to Berlyne (1971), hedonic tone is related to the arousal potential of a stimulus by what is called the Wundt curve: stimuli with low arousal potential elicit indifference, stimuli with medium arousal potential elicit maximal pleasure, and stimuli with high levels of arousal potential elicit displeasure. (Page 42)
  • Habituation refers to the phenomenon whereby repetitions of a stimulus are accompanied by decreases in physiological reactivity to it. The psychological concomitant is becoming used to or bored with the stimulus. Habituation is not merely the polar opposite of need for novelty. Avoiding boredom is not necessarily the equivalent of approaching novelty. People who habituate quickly to stimuli do not necessarily have a high need for novelty (McClelland 1951 ). In fact, creative people like novelty but habituate more slowly than do uncreative people (Rosen, Moore, and Martindale 1983). Because of this fact and because habituation seems to be a universal property of nervous tissue (Thompson et al. 1979 ). (Page 45)
  • Ambiguity is a collative variable. Collative properties such as novelty or unpredictability can vary much more freely than meaning in all of the arts. One soon finds that to increase the arousal potential of aesthetic products over time, one must increase ambiguity, novelty, incongruity, and other collative properties. This is the reason for my theoretical emphasis on collative properties rather than upon other components of arousal potential. (Page 47)
  • Peak shift is a well-established behavioral phenomenon (Hanson 1959). Consider an animal that is rewarded if it responds to one stimulus (such as a 200-Hz tone) and not rewarded if it responds to another stimulus (a 180-Hz tone). After training, the animal will exhibit maximal responsivity at a point beyond which it was rewarded and in a direction away from the unrewarded stimulus (a 220-Hz tone). J.E.R. Staddon (1975) argues that peak shift serves as the force behind sexual selection in biological evolution. Because of peak shift, female birds that prefer to mate with males with bright rather than dull plumage will show even greater preference for males with supernormal or above-average brightness. As a result, such males will mate more often and produce more offspring. As a further result, and because peak shift operates during every generation, the brightness of male plumage in the species will increase across generations. (Page 47)
  • On the psychological level, habituation occurs gradually and the need for novelty is held in check by the peak-shift effect. Thus, an audience should reject not only works of art with insufficient arousal potential but also those having too much. Finally, the principle of least, effort assures us that artists will increase arousal potential by the minimum amount needed to offset habituation. The opposing pressures should lead to gradual and orderly change in the arts. (Page 48)
  • Berlyne ( 1971) pointed out that the evolutionary theory has difficulty in explaining cases such as Egyptian art that show extremely slow rates of change. Yet, though much Egyptian painting was sealed in tombs-hardly a place to bring about speedy habituation-it did evolve as predicted by the theory (see pages 212-19). In general, the more an audience is exposed to a type of art, the faster the art should change. This assumption leads to specific predictions: we should find higher rates of change in living room furniture than in bedroom furniture, in everyday dress than in formal dress, and so on. (Page 52)
  • A simpler explanation is that clothing can lose some of its aesthetic qualities because of functional reasons. Fashion is not a high art. There is a pressure to increase arousal potential, but other pressures can add so much noise that the theory of aesthetic evolution is only marginally — rather than continually –applicable. Consumers can retard change by passive resistance. They can extinguish styles altogether by boycott. Walking sticks and hats are more or less extinct. The audience did this. This fact does not refute the theory of aesthetic evolution. Evolution can occur only when the environment permits it. If a politician kills all the poets, poetic evolution obviously ceases. If poets had to make a living by writing poetry, there would be no poetry; it would have gone the way of the scarlet waistcoat. On the other side of the coin, if the makers of scarlet waistcoats had not held to the silly notion that the only thing they are good for is to wear, there would still be plenty of them , and they would be quite fancy. Their functional aspect put all kinds of constraints on what they could look. like. Recall Kubler’s definition that if something has a use, it is not art. If something has a use, people want it to work. That gets in the way of its aesthetic aspects. If something has a use, people can stop using it and destroy its aesthetic aspects altogether. Being useless has distinct advantages. Paintings don’t come with warranties. Customers can’t return them and say they don’t work–or that they have suddenly started to work. Customers can’t say anything, really. If something is useless, it doesn’t make any sense to say how long it will remain useless or to quibble about exactly how it should be useless. (Page 55)
  • The evolutionary theory can be construed in two ways. The weak version is that the theory explains a bit, but perhaps not much, about art history. The strong version is that the theory explains the main trends in art history. Because the main axes of artistic style-classic versus romantic, simple versus complex-are isomorphic with the main theoretical axes-conceptual versus primordial content, low versus high arousal potential– it is not unreasonable to think that the strong version may be true. (Page 69)

Chapter 3. Crucible in a Tower of Ivory: Modern French Poetry

clock104

(Page 104)

  • We know that the truth must lie somewhere between these two extremes: that is, We know that poets cluster together into groups-for example, some tend to write lyrical poetry, whereas others write epic poetry. We want to know how many dimensions are needed to account for the similarities among poets. Fortunately, once we have correlated all of the poets with one another, a procedure called multidimensional scaling will tell us just this. Multidimensional scaling (Scikit-learn) tells us that the twenty-one French poets differ along three· main dimensions. These three dimensions account for 94 percent of the similarity matrix. (Page 114)

Chapter 4. Centuries of British Poetry

  • When a poetic tradition is first coalescing, poets are less sure of the rules of the game and of who else is playing. (Page 123)
    • I think this is the case for most if not all group consensus under uncertainty. I see this a lot in the D&D data.
  • Thus, legislators are really under pressure to produce new combinations of words, called laws, just as are poets. Suppose Parliament enacts a set of far-reaching new laws (initiates a new style). Subsequent Parliaments are going to have to pass laws that elaborate upon and refine the general laws. Eventually the result is going to be a set of overly specific, complicated, and contradictory laws. At this point, another stylistic change is in order: that is, a new set of general laws. These laws will also need refinement, so the whole process will begin anew. If this is at all close to capturing what legislative bodies do, then we might expect oscillations in primordial content because of oscillations in generality versus specificity of laws- an argument that cursory reading of the British statutes does support. General statutes are often followed by more concrete and specific ones that clarify the earlier statute or limit it so as to prevent unintended consequences. On the other side of the coin, a variety of specific statutes may eventually be replaced by an overarching general law. If we had a more appropriate measure of impact or complexity, we would almost certainly find that the complexity of law has increased across time. In an abstract sense, then, legal and poetic discourse are subject to evolutionary pressures that are isomorphic. (Page 131)

clock141

  • The amount of primordial content in the two prior periods is negatively related to primordial content in a given period. The amount of stylistic change is positively related to stylistic change in the prior period and negatively related to stylistic change two period earlier. These influences of the past on the present set up oscillations. (Page 141)
    • The delay in the connections may matter a lot. There may have to be a graph matrix term included in the chain. Behaviors would be very different for a fast, low weight link and a heavy weight, delayed link. Or is that a property in the reactivity of the node? Or both?
  • Primordial content shows a significant linear increase across all four periods. This linear trend accounts for 59 percent of the overall variability in primordial content. There is no sign of leveling off of this trend during the later periods. The pattern of results is most consistent with the hypothesis that the late metaphysical poets were caught in an evolutionary trap: They engaged in deeper and deeper regression, but did not achieve the desired increases in arousal potential. Once entrapped, the metaphysical style perished and was replaced by the ensuing neoclassical style. (Page 148)
    • I think this is a case where the “elevation” of the fitness landscape is sinking, but the practitioners are hillclimbing the wrong way.

Chapter 5. On American Shores: Poetry, Fiction, and Musical Lyrics

  • I have been talking, I know, in rarefied or abstract terms, our 170 poets ending by being represented by points in four – or five-dimensional hyperspaces. It is where they are in these abstract spaces that I have aimed to account for by four or five equations. The exciting thing is that poetry moves through these spaces in very orderly ways across the course of time and I think we can eventually capture the beauty of this sweep of history with quite simple and beautiful equations. Of course, this is hardly the usual way of talking about the history of poetry. However, I think that it adds to or complements-rather than in any way contradicting or negating-the more usual approaches. (Page 152)
  • To see how much of the overall variability in poetic language the evolutionary theory accounts for, I followed the same procedures that were used for French and English poetry. I first intercorrelated the poets’ profiles on the Harvard III categories to measure their relative similarity. Then, through multidimensional scaling, I found that four dimensions accounted for 94 percent of the variation among the poets. Correlating these dimensions with the content categories suggests that the dimensions arc getting at lyric, narrative, and two sorts of emotional modes. We can account for 25 percent of variation in this space with our theoretical variables-a good bit less than the figure for French and British poetry. One reason for this discrepancy may be that the American series contains more minor poets-who may not really have a good sense of tradition than do the French and British series. They may be perfectly fine poets, but their concerns may be tangential to those of poets writing in the focal American tradition. In fact, if we examine only the four most eminent poets for each period, 43 percent of the similarity among them can be accounted for by the theoretical variables-about the same figure as for French and British poetry (Page 165)
    • American poets are/were more nomadic?
  • you can’t make money writing poetry, the poet has considerable freedom: nothing he or she writes is going to make money, so there is no sense in trying. Unfortunately, the case is different with fiction. You can make money with it, but only if a lot of people like it. Publishing firms publish poetry for prestige, but they want to make a profit with fiction. They resist publishing fiction that no one will buy. This situation puts something of a non-evolutionary pressure on the writer. Even worse, the writer may want to make money and thus write more for an external audience and less for other artists. Of course, the external audience habituates and wants novelty. However, it must habituate more slowly than artists because it is exposed less frequently to literature than they. The real problem with the external audience is that it may place all sorts of non-aesthetic pressures on literature. If the audience takes a puritanical tum, it won’t read pornography. If it gets obsessed with abolishing slavery, it wants Simon Legree stories. If the audience’s desires get really extreme, the environment will be distinctly unfavorable to aesthetic evolution. To use a biological analogy, sexual selection can proceed only in a benign environment. In an environment with a lot of predators, birds of paradise could not have evolved, for their brilliant plumage would have attracted those predators. (page 169)

Chapter 6. Taking the Measure of the Visual Arts and Music

  • After establishing that subjects agreed in their ratings, I obtained a mean score for cach painting on each of the variables by collapsing across subjects. The rating scales were subjected to factor analysis to see whether the scales were measuring, as hoped, a smaller number of underlying dimensions. Factor analysis (tutorial), which represents similarities among scales in a multidimensional space, indicated that the scales were really measuring five axes or dimensions. It was clear that two of these factors were, as hoped, getting at the theoretical variables. One factor-referred to as arousal potential-had high loadings on the scales Active, Complex, Tense, and Disorderly. Another-referred to as primordial content-had high loadings on Not Photographic, Not Representative of Reality, Otherworldly, and Unnatural. (Page 188)

Chapter 7. Cross-National, Cross-Genre, and Cross-Media Synchrony

  • This gives us a closed circle. When the longer series of French and British painting was considered, it was clear that French painting influences British painting more than vice versa. Now we see that American painting is predictable from British painting. Closing the circle, l found French poetry to be at least synchronously related to American poetry Painting seems to be a much more international enterprise than poetry. At least for the last two centuries, British, French, ancl American painting appear to be closely related even when we take statistical precautions to avoid spurious findings. The Italian, British, and French series overlap for several periods. (Page 240)
  • According to Berlyne (1971), a work of art has three types of properties: psychophysical properties intrinsic to the stimulus (such as pitch, hue, intensity), collative properties (such as novelty, complexity, ambiguity), and ecological properties (such as denotative and connotative meaning). Berlyne held that these properties taken together determine the arousal potential of a work of art. Each of these aspects of a work of art offers several possibilities for cross-media correspondence. The psychologist S.S. Stevens (1975) showed that cross-modal matching on the intensity dimension is reliable. Another psychologist, Lawrence Marks (1975), showed that there are reliable consistencies in the matching of the brightness of colors and the “brightness” ( defined by pitch and loudness) of sounds. Collative variables (such as complexity or novelty) are clearly defined across all media. Kenneth Burke ( 195 7) argued that cross-media styles are based on two dimensions that Berlyne would call collative variables: unity and diversity. Meaning can also serve as the basis for cross-media styles-as is obvious in art forms having specific content, such as representational painting and sculpture. In terms of connotative meaning, Charles Osgood, George Suci, and Percy Tannenbaum’s (1957) three factors of evaluation, potency, and activity show up in a variety of rating tasks irrespective of what is being rated. Thus, connotative meaning could serve as a basis for equating works in different media. Finally, overall arousal potential or resultant liking (see figure 2.1, page 42) could serve as a basis for cross-media styles. I have argued that it makes sense to talk about primordial content in all of the arts. The experiment I did with Ross and Miller (1985) suggests that this cross-media effect is more than giving the same name to different things. In our experiment, people wrote fantasy stories in response to paintings. Primordial content in the stories-as measured by content analysis-was correlated with primordial content-as measured by rating scales-in the paintings. (Page 251)

Clock 254

Page 254

  • Thus, artistically untrained subjects do show a clear and significant tendency to perceive cross-media styles. Both baroque and neoclassical styles differ from the romantic style on the first dimension: that is, people see baroque and neoclassic styles as being similar to each other and dissimilar to the romantic style. Baroque and neoclassical styles do not differ significantly on this dimension. The second dimension differentiates baroque and neoclassical styles and, to a lesser extent, neoclassical and romantic styles. Baroque and romantic styles do not differ significantly on this dimension. Thus, people see baroque and romantic styles as similar to each other and dissimilar to the neoclassic style on this dimension. In summary, people perceive the three styles as varying in two fundamental ways, or along two dimensions. Unfortunately, multidimensional scaling does not tell us how to label these dimensions. (Page 255)

Chapter 8. Art and Society

  • Marxism ultimately attributes much social and cultural change to the emergence of new technology: technological innovations create new means of production; a class struggle ensues for control of these means of production. (page 264)

Chapter 9. The Artist and the Work of Art

  • I have argued that the effect of the pressure for novelty arises not so much from its strength as from its persistence. In the first part of this chapter, I investigate whether the pressure for novelty is strong enough to affect individual artists. Does the evolutionary theory help us understand only the great sweep of history, or can it shed light on stylistic development in the individual artist? Studies of creators as diverse as Beethoven and Grieg, Picasso and Rembrandt, and Dryden and Yeats suggest that the evolutionary theory is, indeed, relevant to the individual artist. At the end of the chapter, I report on some quantitative studies of nonevolutionary forces, such as temperament and psychopathology, that also shape the content of poets’ verse. In the middle part of the chapter, I turn quantitative methods upon individual works of art, ranging from prose through poetry to music. We find coherent trends across the course of literary narratives-trends that arise, in part, from a need to keep the reader’s attention but, in larger part, from what an author is trying to say. In fact, the type of trend we find sheds valuable light on what an author is–consciously or unconsciously-trying to say. (Page 284)
  • Aesthetic evolution shows a surprising self-similarity under magnification: when we examine the course of primordial content across centuries, we find long-term trends with superimposed oscillations. If we examine the course of primordial content across the career of an individual artist, we see exactly the same thing on a smaller time scale. Theoretically, these patterns are caused by the need to increase arousal potential. This need is caused by habituation: the audience-and the artist-tires of old forms and wants new ones. Habituation corresponds, as I have said, to building up a set of expectations. After these expectations are well developed, a work of art that conforms too closely to them will elicit neither interest nor pleasure. It is to avoid this sad fate that the artist must increase the arousal potential of his or her works. (Page 312)
  • On the psychological level, the theme of the journey to hell and back hypothetically symbolizes a regression from the conceptual (abstract, analytic, reality-oriented) thought of waking consciousness to primordial (concrete, free-associative, autistic) thought and then a return to conceptual thought. Of course, the psychoanalyst Ernst Kris (1952) holds that any act of creation involves an initial stage of inspiration and a subsequent stage of elaboration. In the inspirational phase, there is a regression toward primordial thought, whereas in the subsequent elaboration stage there is a return to analytic thinking. The inspirational stage yields the “rough draft” of the creative product, whereas the elaboration stage involves logical, analytical thought in putting the product into final form. Thus, the theme of the night journey mirrors the psychological processes involved in the creation of art. (Page 314)
  • If a narrative describing a journey to hell or a similar region symbolizes . regression from a conceptual to a primordial state of consciousness and a subsequent return to a conceptual state, we might expect that words indicating primordial content would first increase and then decrease in the narrative. (Page 315)
  • These two passages from page 333 describe earlier concepts of “coordinate frames” that we have used (in western society only?) to position imagination:
    • The names will be different, but the idea is the same Since Galen’s theory of the temperaments was good enough for Kant, Wundt, and Pavlov, I’ll just use the old terms. The idea of temperament theory is that people differ along two dimensions: sanguine (optimistic and happy) versus melancholic (pessimistic and anxious) and choleric (active and quick to anger) versus phlegmatic (calm and passive) Galen connected the temperaments with bodily fluids which were seen in turn as expressions of the four elements: air versus earth and fire versus water. A nice theory, but it turned out to be wrong on the level of physiology and physics. Maybe, though, the connection between elements and temperaments is right on the psychic level.
    • At least in states of fantasy and reverie, there is a correspondence between temperaments and elements. As the literary critic Northrup Frye put it, “earth, air, water, and fire arc still the four elements of imaginative experience, and always will be” (Bachelard 1964, p. vii [1938]).

Clock 335

Page 335

Chapter 10. Science and Art History

  • To test these predictions, I had eighteen subjects rate their liking for the Italian paintings shown in chronological order and seventeen subjects rate them when shown in reverse chronological order (Martiudale 1986a). To disguise the real purpose of the experiments, the Like-Dislike scale was embedded in a set of seven other scales. Since subjects showed highly’ significant agreement in their preferences, a mean preference rating was computed for each painting in both experiments. As predicted, the correlation of liking with order was insignificant for the chronological group. On the other hand, it was strongly negative — much more so than when the paintings were shown in a random order-for the reverse-chronological group (Page 341)
    • Good discussion on methods

Owl2Cat

Page 343

  • A really interesting element to consider is how the transformation from owl to cat differs from “morphing” to latent-space interpolation. Morphing is a dissolve that includes control points for an animated transition that typically involves input from users (e.g. the animator). Latent space interpolation lacks human involvement and creates images from that latent space that would probably be outside of human cognition:
    • BigGan
    • Humans seem to slowly generalise from a less familiar object until it resembles something more familiar, where detail is added up to a point, where it then varies.
  • Should we conclude from the changes in primordial content that the later drawings were made in more primordial states of mind? This is a possible but not necessary conclusion. The most parsimonious explanation would probably be that the later drawings are less realistic because of the cumulation of simplifying memory effects and lack of skill. It is reasonabler though, to expect variations in the degree to which thought is primordial even in a laboratory situation. Primordial cognition does not mean only extreme states such as dreams and delirium, which are in fact not conducive to artistic production. Our thoughts are always varying to a slight degree along the conceptual-primordial axis (Klinger 1978; Singer 1978). It is exactly such slight variations that I have been talking about whenever I have claimed. that artists are creating in an altered state of consciousness. (Page 344)
  • The series of responses to the sentence stem “A table is like ___ ” shows remarkable parallelism with what has occurred in French poetry since 1800. In order, the responses were:
    1. the sea, quiet
    2. a horizontal wall
    3. a formicaed [sic] bed
    4. the platonic form
    5. a dead tree
    6. a listening board
    7. versatile friendship
    8. vanquished forest
    9. a seasoned man
    10. two chairs
  • The first three responses are based upon the flatness of the surface of a table. The fourth response mediates between these and the next two, which refer to the material composition of tables. All of these similes are clearly “appropriate,” suggesting a high level of stylistic elaboration: irrelevant responses were being filtered out. This filtering is discarded with the seventh response. The ninth and tenth responses clearly indicate a new, less elaborated “style.” The ninth response may be mediated by the word “seasoned,” which is transferred from wood to man: man and wood share a common attribute, although in a completely different sense; therefore they are compared. The tenth response is based upon the close associative connection between table and chair. Associative contiguity overrides objective dissimilarity–exactly as happened in twentieth-century French poetry. Presumably, subjects in the later trials could continue to increase originality only by lowering level of elaboration, by applying less stringent stylistic rules to their responses. (Page 347)
  • The task of a scientist is to produce ideas. That wouldn’t be hard if it weren’t for the constraints. One of the most severe of these is that the ideas have to be true. More exactly, the ideas cannot be contradicted by empirical evidence. Furthermore, this constraint cannot be evaded by producing ideas that cannot in principle be tested against reality. Scientific ideas have to be falsifiable (Popper 1959). They have to be susceptible of being shown to be incorrect. This selection pressure is common to all scientific disciplines. Scientists are in the position of poets before the twentieth century, who had to produce realistic similes. Why don’t scientists discard this constraint and produce surrealistic science? In fact, this is just what mathematics has done. In its early stages, mathematics was an empirical science. The constraint for realism has long since been abandoned, so that mathematicians produce “theories” that do not necessarily correspond to any empirical reality. (Page 350)
  • Can we make an analogous argument for science? I think that we can, but the pressure to increase arousal potential in science must lead in an opposite direction: toward less rather than more primordial thought. Broadly speaking, a poet’s task is to create ideas of the form “x is like y” where x and y are coordinate terms or concepts on the same level. Habituation of arousal potential forces successive poets to draw x and y from ever more distant domains. They seem to do this by engaging in ever more primordial thought over time. Scientists, on the other hand, have to produce statements of the form “x is related toy, ” where x and y are on different levels. (Page 354)
  • These people formulate the general laws and develop the basic methods that define the paradigm. After the paradigm is established, normal science begins. Normal science is carried out by the members of what Crane calls an invisible college-a group of people who interact with each other and share the same goals and values. In the late stages of a paradigm, most major problems have been solved. This leads scientists to specialize on increasingly specific problems. This is necessary in order to extract the remaining ideas. In this stage, anomalies may also appear. In the final stage, there is exhaustion or crisis. The paradigm offers few possibilities and many problems or anomalies. As a consequence, members defect and are not replaced by new adherents. Very much the same thing happens in art and literature. A style gathers recruits who tend to know and interact with each other. There is excitement at first; but eventually the possibilities of the style are exhausted, and it becomes stale and decadent. Crane ( 1987) has analyzed the recent history of American painting in a manner analogous to her earlier analysis of science. (Page 358)
  • A paradigm can die for two reasons. There can be exhaustion without anomaly: that is, the paradigm has succeeded too well, and there are no problems left to be solved, as happened to Euclidean geometry. After the time of Euclid, there was nothing left to be done. The fit with reality had become perfect. Although Crane does not put it in these terms, we could say that arousal potential fell to zero, and the field elicited no further interest at least in the sense of scientists wanting to work in the area. In the case of exhaustion with anomaly, a different set of affairs exists. The exhaustion is not perceived as such. What is perceived is that a fit with reality can be obtained only with very complex theories. Further, new ideas can be generated only with considerable effort. In addition, there are undeniable anomalies or incongruities. Complexity, effort, and incongruity are likely to cause negative affect. In contrast, a new paradigm- if it is to be successful- will be more attractive: the fit with reality is not perfect, but neither are anomalies present as in an old paradigm. Anticipated payoff in relation to effort is much higher.
    The new paradigm usually wins by default. The old paradigm does not die. Its adherents die. Because they have not been able to recruit new disciples, the paradigm dies with them. Once this happens, the new paradigm becomes dominant, and the process begins anew. (Page 358)

An almost-from-scratch Python example of a simple neural network

Introduction

A rite of passage in understanding machine learning is writing your own network from scratch. This isn’t usually about making a better framework, it’s about figuring out what’s going on in all those frameworks. What follows is my contribution to this small but growing genre of programming literature.

The code is based on Andrew Trask’s Grokking Deep Learning (Github), which I’ll refer to as GDL below. No, I haven’t finished it, but this is the first major milestone, so I’m documenting it before I forget.

My background is that of a developer. I have been programming for a living since the 1980’s, mostly across the Object-Oriented landscape. I like classes and generalized solutions. It’s how I think, so this will be the framing that I use for this example.

There are two files: SimpleLayer.py, a class that handles the particulars of what a layer in a network needs to do, and simple_nn.py, a file that exercises that class by building a three-layer NN. We’ll walk through simple_nn.py first, which sets up and runs the network. Then we’ll walk through SimpleLayer, which handles training and backpropagation. At the bottom of the post are full code listings, which are also available on GitHub if you want to use this as a basis for experimentation.

simple_nn.py

Let’s start at the beginning:

import numpy as np
import matplotlib.pyplot as plt
import src.SimpleLayer as sl

One of the things that I try to do in these sort of exercises is to keep the amount of libraries to a minimum. For this I use two very vanilla imports, NumPy for math, and Matplotlib for diagrams of the weights changing over time.

Next, some global variables:

# variables ------------------------------------------
# The samples. Columns are the things we're sampling, rows are the samples
streetlights_array = np.array( [[ 1, 0, 1 ],
[ 0, 1, 1 ],
[ 0, 0, 1 ],
[ 1, 1, 1 ]])
num_streetlights = len(streetlights_array[0])
num_samples = len(streetlights_array)

# The data set we want to map to. Each entry in the array matches the corresponding streetlights_array row
walk_vs_stop_array = np.array([[1],
[1],
[0],
[0]])

Here is all the data about the lights. Each row is a sample. Each element in the row is a light. There are four rows in this set. They are matched to a classification of each row in the walk/stop array. These values come from GDL, where the premise is that you have a set of samples from three lights (streetlight_array), and a set of samples of actions that happen (walk_vs_stop_array).

lights

Figure 1: The lights and the behaviors from GDL

The goal is to train a network from the input streetlights  that produces the right walk/stop output. Now in a real network, we’d worry about overfitting and other related issues, but we’re going to ignore that here.

The next two variables are not in GDL. These are layer_array, which will contain the instances of the SimpleLayer class, and error_plot_mat, which will be used by pyplot to draw a chart of the error converging to zero. Or failing to, as the case may be.

# set up the dictionary that will store the numpy weight matrices
layer_array = []

error_plot_mat = [] # for drawing plots

There is one last bit of setup before we start doing things. There are three methods that will be used later in the program:

# Methods ---------------------------------------------
# activation function: sets all negative numbers to zero
# Otherwise returns x
def relu(x: np.array) -> np.array :
    return (x > 0) * x

# This is the derivative of the above relu function, since the derivative of 1x is 1
def relu2deriv(output: np.array) -> np.array:
    return 1.0*(output > 0) # returns 1 for input > 0
    # return 0 otherwise

# create a layer
def create_layer(layer_name: str, neuron_count: int, target: sl.SimpleLayer = None) -> 'SimpleLayer':
    layer = sl.SimpleLayer(layer_name, neuron_count, relu, relu2deriv, target)
    layer_array.append(layer)
    return layer

Let’s go through these one at a time. First, I’d like to say that as someone who likes compiling, strong typing, and all those components that keep me from doing dumb things, I am a fan of Python’s accommodation of typing. It could be better, but it helps:

  • def relu(x: np.array) -> np.array : 
    • This is an example of an activation function. It will be used in the layer to determine whether or not a value propagates through the neuron. In this case, all it does is clamp negative values to zero.
  • def relu2deriv(output: np.array) -> np.array:
    • This is the inverse of the above function, and returns a one (the slope of the line in relu()) if the value is greater than zero.
  • def create_layer(layer_name: str, neuron_count: int, target: sl.SimpleLayer = None) -> ‘SimpleLayer’:
    • This is what I use to create a layer. You pass in the name for your layer, how many neurons it has, and its ‘target’, or the layer below it. One of the things that I discovered when writing the SimpleLayer class is how intimately layers are connected. In this case, building any other layer than the last requires a target layer. This allows the weights that will manage the influence between the neurons in each layer to be set up properly. It could have just as easily been built from top to bottom, and pointed at the ‘source’ layer.
    • The other thing that this method does is to store the newly created layer in the layer_array, which makes experimenting with adding and deleting layers trivial.

Ok, let’s set up the layers in our network! Again, this is a reimplementation of the network built in GDL (chapter 6)

np.random.seed(0)

#set up the layers from last to first, so that there is a target layer
output = create_layer("output", 1)
output.set_neurons([1])

hidden = create_layer("hidden", 4, output)
hidden.set_neurons([1, 2, 3, 4])

input = create_layer("input", 3, hidden)
input.set_neurons([1, 2, 3])

for layer in reversed(layer_array):
    print ("--------------")
    print (layer.to_string())

First, we seed the random generator with a value so that these results are repeatable. It turns out that this network can be made to converge very fast or not to converge simply by picking a different seed. We’ll discuss what this implies later.

So what we’ve created is a stack of layers, built from bottom to top that looks like this:

  • Input layer (3 neurons): This is where the streetlights information will be loaded
  • Hidden layer (4 neurons): This layer mediates the interactions between the input and output layers, making these interactions nonlinear. It’s what allows deep neural networks to learn nonlinear, discontinuous functions from examples (And remember, that’s all that neural networks do. Though to be fair, it may be all our brains do, too…
  • Output layer (1 neuron): This is where the walk/stop values will be used to adjust the output that is generated starting with the original, random weights.

We then print out the contents of each layer. One quick note – I load the neurons up with sequential integers (e.g. [[1. 2. 3.]]). These values get overridden when the system is run, so it’s just a way to quickly verify the as-built neurons :

--------------
layer input: 
target = hidden
source = no source
neurons (row) = [[1. 2. 3.]]
weights (row) = 
[[-0.1526904   0.29178823 -0.12482558  0.783546  ]
 [ 0.92732552 -0.23311696  0.58345008  0.05778984]
 [ 0.13608912  0.85119328 -0.85792788 -0.8257414 ]]
--------------
layer hidden: 
target = output
source = input
neurons (row) = [[1. 2. 3. 4.]]
weights (row) = 
[[0.09762701]
 [0.43037873]
 [0.20552675]
 [0.08976637]]
--------------
layer output: 
target = no target
source = hidden
neurons (row) = [[1.]]
weights (row) = 
None

I think this output is pretty obvious, aside from the weights, so let’s look at them more closely. But first, a short digression.

Normally when I see diagrams and descriptions of connected layers of neurons, I usually see something like this:

FullyConnected

Figure 2: The typical neural network diagram

As you can see, each neuron in the input layer is connected to each neuron in the hidden layer and so on through to the output layer. And that’s nice conceptually, but as a developer, I have no understanding of the mechanics of what’s happening. Here’s what it really looks like. For clarity, only the interactions involving input neuron 1 and hidden neuron 4 are shown, but the process is identical:

FullyConnectedWeights

Figure 3: How the input and hidden layer are actually connected

In this case, we’re looking at the mapping between the input layer and the hidden layer. Each neuron in the input layer gets its own row of weights, let’s say [0.1, 0.2, 0.0, 0.5] for neuron one. If that neuron is set to “10”, then a value of 1.0 will go to hidden neuron 1, a value of 2 to hidden neuron 2, and a value of 5 to hidden neuron 4. This process is repeated for each neuron in the input layer, and the value is added to the associated hidden neuron.

That’s what we mean when we talk about fully connected layers. Everything is mediated through an adjacency matrix of weights. We’ll revisit this in more detail when we walk through SimpleLayer.  So in the listing above the two figures, the randomly initialized weights are organized so that the each neuron in the layer has its own row. Each entry in that row is the scalar value that the row’s neuron value will  be multiplied by as it is accumulated in the target’s neuron. Source neurons are the row component. Target neurons are the column component.

Next is the body of the program:

alpha = 0.2
iter = 0
max_iter = 1000
epsilon = 0.001
error = 2 * epsilon
while error > epsilon:
    error = 0
    for sample_index in range(num_samples):
        input.set_neurons(streetlights_array[sample_index])
        for layer in reversed(layer_array):
            layer.train()

        delta = output.calc_delta(walk_vs_stop_array[sample_index])
        sample_error = np.sum(delta ** 2)
        error += sample_error

        for layer in layer_array:
            layer.learn(alpha)

        # Gather data for the plots
        error_plot_mat.append([sample_error])
        # print("{}.{} Error = {:.5f}".format(iter, sample_index, sample_error))

    error /= num_samples
    if (iter % 10) == 0 :
        print("{} Error = {:.5f}".format(iter, error))
    iter += 1
    # stop even if we don't converge
    if iter > max_iter:
        break

Let’s talk about the local variables first. The first variable, alpha, is the learning rate that we pass in. It’s a scalar that limits the step size in the change in weights. The bigger the scalar, the more likely to overshoot the goal and go into oscillation around it. The smaller the goal, the longer the approach will take, but the greater the chance that it will stabilize. Like the seed we use to set the random number generator, the number of layers, and the number of neurons per layer, this is a hyperparameter. Some, like alpha, result in predictable behavior. Others, like seed, do not. There is a lot of this in deep learning, and you need to be careful about it. In particular, testing the resiliency of the solution by running it with a variety of nonlinear hyperparameters to see if the results are consistent is probably a good idea, though it sucks up compute resources.

The rest of the variables are used for loop control:

  • iter is the current count of times through the loop
  • max_iter is the maximum times we’ll run through the loop, even if we don’t converge
  • epsilon is the error threshold. If the error drops below that, we’re done.
  • error is the sum of the squares of all the output neurons (in this case, one). We initialize it to a value that gets us into the loop
while error > epsilon:
    error = 0
    sample_error_array = []
    for sample_index in range(num_samples):
        input.set_neurons(streetlights_array[sample_index])

This is the main loop. First, we’re going to loop until our error is small. Error is computed by sample, so we need to know what the average (or max – I use average here) error is for each iteration. We also want to save the individual errors by sample for later plotting.

Within each loop, we’re going to evaluate the input streetlights sample against the output walk/stop sample. The first step in this process is to set the input neurons. This is where the input.set_neurons([1, 2, 3]) that we did when we were creating the layers gets overridden. In the training, the output from this layer will overwrite the values in the next layer and so on.

for layer in reversed(layer_array):
    layer.train()

This is the training step. We’ll go into more detail when we walk through SimpleLayer, but for now not that we set through all the layers from the top to the bottom, in reversed order from how they were created and loaded into layer_array.

delta = output.calc_delta(walk_vs_stop_array[sample_index])
sample_error = np.sum(delta ** 2)
error += sample_error

This is where we calculate the array of deltas that are the difference between the goal of the walk/stop array and the output neurons. the error is the sum of the squares of all those deltas. SoS is nice because it’s always positive.

for layer in layer_array:
    layer.learn(alpha)

# Gather data for the plots
sample_error_array.append(sample_error)
# print("{}.{} Error = {:.5f}".format(iter, sample_index, sample_error))

Learning is done from bottom to top, using the deltas stored in the output layer. These are backpropagated through the layers, and the changes in the weights are scaled to 20% of the calculated values so we settle nicely.

We also gather the error data (for each streetlight-walk/stop sample) into a matrix that we can print out when we’re done. If we want to, we can print the error for each sample in the training. Some converge faster than others, but this is not the best way to see that.

error /= num_samples
if (iter % 10) == 0 :
    print("{} Error = {:.5f}".format(iter, error))
iter += 1
# stop even if we don't converge
if iter > max_iter:
    break

At the bottom of the loop, we calculate the average error over all the samples. We then see if we’ve been here too long and break if we are, regardless of whether we’ve converged or not. And lastly, this is how I like to print formatted strings in Python (essentially the same as “%.5f” in Java/C/etc).

Once the loop terminates, we need to see how well the network has learned. As I said earlier, in a real machine learning situation we would be careful about issues such as overfitting by, for example, training against one set of data and testing against another. But since this is a toy problem, so we are simply going to see how it did with the training data. I’ve added some explicit variables for clarity:

  • prediction: The contents of the single neuron in the output layer
  • observed: The value in the walk/stop array that we’re evaluating against
  • accuracy: how close did we get?
print("\n--------------evaluation")
for sample_index in range(len(streetlights_array)):
    input.set_neurons(streetlights_array[sample_index])
    for layer in reversed(layer_array):
        layer.train()
    prediction = float(output.neuron_row_array)
    observed = float(walk_vs_stop_array[sample_index])
    accuracy = 1.0 - abs(prediction - observed)
    print("sample {} - input: {} = pred: {:.3f} vs. actual:{} ({:.2f}% accuracy)".
          format(sample_index, input.neuron_row_array, prediction, observed, accuracy*100.0))

Since the network is already set up with weights, all we need to do is to see how well our inputs match to our outputs. All this means is to take a set of inputs and run them forward to the model. There will be no learning via backpropagation.

So let’s see how we did!

0 Error = 0.35238
10 Error = 0.29001
20 Error = 0.19074
30 Error = 0.12883
40 Error = 0.04666
50 Error = 0.00544

--------------evaluation
sample 0 - input: [[1. 0. 1.]] = pred: 0.978 vs. actual:1.0 (97.78% accuracy)
sample 1 - input: [[0. 1. 1.]] = pred: 1.000 vs. actual:1.0 (100.00% accuracy)
sample 2 - input: [[0. 0. 1.]] = pred: 0.037 vs. actual:0.0 (96.27% accuracy)
sample 3 - input: [[1. 1. 1.]] = pred: 0.000 vs. actual:0.0 (99.95% accuracy)

As you can see, the values converge in less than 60 iterations, and the predictions are quite close. For the second and fourth stoplight pattern, the results are basically exact (100% and 99.95%). That’s not bad for a bunch of random numbers and two simple rules.

These are the kinds of outputs that you get with heavyweight packages like Keras. It’s helpful (We trained successfully! Horay!). And these types of outputs make sense when models are huge – or even bigger toy problems like MNIST (which we will explore in a future post).

But this is toy code for a toy problem so we can show more than that. Being able to visualize what’s going on is very helpful. That’s why the error for each step has been saved in error_plot_mat.

Plotting data like this in Python is one of the joys of using the language. Here’s what it takes:

# plots ----------------------------------------------
fig_num = 1
f1 = plt.figure(fig_num)
plt.plot(error_plot_mat)
plt.title("error")
names = []
for i in range(num_samples):
        names.append("sample_{}".format(i))
names.append("average")
plt.legend(names)

for layer in reversed(layer_array):
    if layer.target != None:
        fig_num += 1
        layer.plot_weight_matrix(fig_num)

for layer in reversed(layer_array):
    fig_num += 1
    layer.plot_neuron_matrix(fig_num)

plt.show()

We are going to be creating a bunch of plots. One for the error, and then one for each set of neurons and their weights. We’ll get back to the layer plots when we’re walking through SimpleLayer, but here’s a plot of all the errors, by sample and average for the entire training session:

outputerror

Figure 4: Error for each sample

Some things worth noting are this is not a linear process. There are times where the learning process is pretty slow, particularly at the beginning in this example. The second observation is that zero error happens much sooner for some samples than others. The first sample with zero error happens around step 150 (iteration 37 or so of the main loop). If the exit condition were based on looking at one sample instead of the average of all the sample errors, the system could exit early. I had this happen when I was using sample_error rather than error in the exit condition. It took a while to figure out why some seed values behaved so differently from others….

And that ends the tour of the main loop. Next, we’ll look at how a layers interact to train and learn.

SimpleLayer

The previous section is roughly equivalent to a Keras, Torch, or other machine learning framework. You get an idea of the behavior of a system and how the construction affects the output, but the details of the implementation are hidden. In this section, we’re going to look at the creation of a layer in detail – the ways they are connected and the ways that they communicate. As with the walkthrough of the main loop, we’ll start with the construction of the layer, then the forward learning process, the training backpropagation process, and graph what’s going on.

Construction

As with simple_nn.py, SimpleLayer is written to have very few dependencies. I actually struggled with whether or not to write my own matrix math, but I think NumPy is pretty clear, and it would get distracting with all the additional code.

import numpy as np
import matplotlib.pyplot as plt
import types
import typing

There are some class-wide variables that we should describe:

class SimpleLayer:
    name = "unset"
    neuron_row_array = None
    neuron_col_array = None
    weight_row_mat = None
    weight_col_mat = None
    plot_mat = [] # for drawing plots
    num_neurons = 0
    delta = 0 # the 'movement' scalar
    target = None
    source = None
    activation_func = None
    derivative_func = None

In order of declaration, these are

  • name: the string name of the layer. Used in printing and surprisingly useful in debugging
  • neuron_row_array: the neurons in row form (i.e. [[n1, n2, n3, … , nN])
  • neuron_col_array: the transpose of neuron_row_array (i.e. [[n1], [n2], [n3], … ,[nN]]. We need the data in both forms for interactions between layers
  • weight_row_mat: the weights in row format, as above
  • weight_col_mat: the weights in column format, as above
  • weight_history_mat: where the weight data from each training pass is stored for plotting
  • neuron_history_mat: where the neuron data from each training pass is stored for plotting
  • num_neurons: the number of neurons in this layer
  • delta: the scalar that changes the size of the “step” this layer takes as it tries to converge on the goal. Passed in as alpha in simple_nn.py
  • target: the layer “below” this layer. May be NULL
  • source: the layer “above” this layer. May be NULL
  • activation_func: the function that controls the nonlinearity of the training process. Passed in as relu() from simple_nn 
  • derivative_func: the function used in backpropagation that is the derivative of the activation function. Passed in as relu2deriv() in simple_nn

Next is the initialization, which is done through the constructor:

# set up the layer with the number of neurons, the next layer in the sequence, and the activation/backprop functions
def __init__(self, name, num_neurons: int, activation_ptr: types.FunctionType, deriv_ptr: types.FunctionType, target: 'SimpleLayer' = None):
    self.reset()
    self.activation_func = activation_ptr
    self.derivative_func = deriv_ptr
    self.name = name
    self.num_neurons = num_neurons
    self.neuron_row_array = np.zeros((1, num_neurons))
    self.neuron_col_array = np.zeros((num_neurons, 1))
    # We only have weights if there is another layer below us
    for i in range(num_neurons):
        self.neuron_history_mat.append([])
    if(target != None):
        self.target = target
        target.source = self
        self.weight_row_mat = 2 * np.random.random((num_neurons, target.num_neurons)) - 1
        self.weight_col_mat = self.weight_row_mat.T

This takes the values supplied in the create_layer() method in simple_nn.py and bulds the layer. Once the local variables are set, the matricies of neurons are created.

If there is a target, the two layers are connected. What this means is that the source layer creates a numpy matrix that has as many rows as the source neurons and as many columns as the target neurons (See figure 3). This matrix is the weights that are used to uniquely distribute the value of each neuron in the source layer to each neuron in the target layer. As with the neurons, this is stored in row and column form.

Once each layer is set up, we are ready to begin the training process.

Training

Training a neural network is the process of take a set of input values and sending them through the entire network to get an output. We can compare that output to the desired value, and then adjust. Using the mechanism of a deep neural network allows us to build a system that can map many input values to a desired output value. In this case, we’re looking at three values in an array, but using exactly the same structure, we can increase the number of values to be the pixels in an image and the output to be the label for that image:

cfar-10

Figure 5: The CFAR-10 Dataset

That takes more layers and some other tricks, but the basic technique is the same.

Ok, back to three values in an array that represent some streetlights. To get this into the input layer, we use the set_neurons() method:

# Fill neurons with values
def set_neurons(self, val_list: typing.List):
    # print("cur = {}, input = {}".format(self.neuron_array, val_list))
    for i in range(0, len(val_list)):
        self.neuron_row_array[0][i] = val_list[i]
    self.neuron_col_array = self.neuron_row_array.T

The numpy neuron arrays are actually two-dimensional arrays that are one element deep. This supports numpy array math like dot product and transpose. That’s why the awkward syntax where we take the val_list and set the neurons to those values. We then take the transpose immediately so that I don’t have to wonder if it’s been done already.

The next step is to ripple the values through the network layers:

def train(self):
    # if not the bottom layer, we can record values for plotting
    if(self.target != None):
        self.weight_history_mat.append(self.nparray_to_list(self.weight_row_mat))

    # if we're not the top layer, propagate weights
    if self.source != None:
        src = self.source
        # set our neuron values as the dot product of the source neurons, and the source weights
        self.neuron_row_array = np.dot(src.neuron_row_array, src.weight_row_mat)

        # No activation function to output layer
        if(self.target != None):
            # Adjust the values based on the activation function. This introduces nonlinearity.
            # For example, the relu function clamps all negative values to zero
            self.neuron_row_array = self.activation_func(self.neuron_row_array)

        # Transpose the neuron array and save for learn()
        self.neuron_col_array = self.neuron_row_array.T

    # record values for plotting
    for i in range(self.num_neurons):
        self.neuron_history_mat[i].append(self.neuron_row_array[0][i])

We start to see how intimately the layers are connected in this method. We look to the target and source layers to adjust our behaviors and set values.

Since this is the top layer, we have no source. That means that record our weights for later plotting and we’re done. The layer below us will set its neurons based this layer’s weights and neurons, as handled in this line:

self.neuron_row_array = np.dot(src.neuron_row_array, src.weight_row_mat)

This is just the first step. If we’re not the bottom layer, we have to see if the neuron values make it past the activation function that we set in simple_nn.py:

# activation function: sets all negative numbers to zero
# Otherwise returns x
def relu(x: np.array) -> np.array :
    return (x > 0) * x

This is done with these lines:

# No activation function to output layer
if(self.target != None):
    # Adjust the values based on the activation function. This introduces nonlinearity.
    # For example, the relu function clamps all negative values to zero
    self.neuron_row_array = self.activation_func(self.neuron_row_array)

By running these same methods on each successive layer object, the streetlight values are slowly, and nonlinearly (in multi-layer networks, which is critical) modified to produce a single output. Unfortunately, that output is guaranteed to be wrong, since it’s based on multiplying the input values by a bunch of random values that we set up each layer with.

Time to fix that.

Learning

Back in simple_nn.py, between the train() and the learn() loops is this line:

delta = output.calc_delta(walk_vs_stop_array[sample_index])

The delta saves out the error for the plotting. The function sets up the values for the learning step:

def calc_delta(self, goal: np.array) -> float:
    self.delta = goal - self.neuron_row_array
    return self.delta

self.delta is a numpy array that stores the difference between the goal(s) and the current value. In this case, there is only one value, but this also works with multiple values. That’s another trick that gets used in training networks. For example, in handling the CIFAR images, there is an output neuron for each category (e.g. horse, automobile, truck, ship, etc.). In out toy example and in the CIFAR case, the goal is a one or zero in the output neuron(s). The delta is the difference between the computed value and the goal. That delta is what we will now backpropagate through the layers, from back to front. And that’s the learning process.

In learning, the basic goal is to adjust the weights that set this layer’s neurons (in this implementation, the source layer). This is done by backpropagating the error delta from this layer to the source layer. Since we only want to adjust the weights that participated in the training, we need to take the derivative of the activation function in train(). Again, the weight matrix is simply the source neurons times this layer’s neurons. For example, if the source layer had three neurons and this layer had four, then the (source) weight matrix would be 3*4 = 12 weights. The whole method is shown below.

def learn(self, alpha):
    if self.source != None:
        src = self.source
        delta_scalar = np.dot(self.delta, src.weight_col_mat)
        delta_threshold = self.derivative_func(src.neuron_row_array)
        src.delta = delta_scalar * delta_threshold
        mat = np.dot(src.neuron_col_array, self.delta)
        src.weight_row_mat += alpha * mat
        src.weight_col_mat = src.weight_row_mat.T

There’s a lot going on here, so let’s go through it slowly:

def learn(self, alpha):
    # if there is a layer above us
    if self.source != None:
        src = self.source

Since weights exist between neurons, we once more have the intimate relationship between this layer’s neurons and the layer above this layer. If there is no layer above us, there is literally nothing to do, which is why this test is first.

delta_scalar = np.dot(self.delta, src.weight_col_mat)

Next, we calculate the error delta scalar array, which is the amount the source layer needs to change (set initially in the output layer’s  calc_delta(), then rippled up through the layers), multiplied across the weights used to set this layer’s neurons (in the source).

delta_threshold = self.derivative_func(src.neuron_row_array)

In the train() process, we distributed the values in a non-linear way – any neuron value below zero was not distributed. (the relu() function from simple_nn.py). That process needs to be mirrored in the backpropagation process. There is always a matched pair of methods that make the core of a neural network – the activation function, and the derivative function.

src.delta = delta_scalar * delta_threshold

This is where the actual change for the source layer is calculated. It’s the product of the delta_scalar and the delta_threshold that we’ve just calculated. This is where the decision process of the derivitive_func() is scaled to the desired amount (the alpha value that we pass in from simple_nn.py). This value will be used when the learn() method is called for the source layer. Like I said, layers are intimately connected.

We now take the self.delta that was calculated in our target layer’s learn() method, and use it to adjust the weights in the source layer that will be used to set our neuron’s values on the next train() pass.

mat = np.dot(src.neuron_col_array, self.delta)
src.weight_row_mat += alpha * mat
src.weight_col_mat = src.weight_row_mat.T

This matrix (mat) contains the adjustments for the source layer’s weights. We want to add a fraction of these (or we won’t converge) values, so we multiply by alpha. The last step is simply making the transpose of the weight matrix.

And that’s pretty much the guts of this implementation. The important things to remember are:

  • Input and output layers are special cases. The neurons are explicitly set in the input layer and there is no activation or derivative function applied to the output neurons (no, I don’t know why yet. When I figure that out, I’ll explain why here)
  • In training, the current layer’s neuron’s values are set by multiplying the source neurons by the source weights.
  • In learning, the source layer’s weights are adjusted by the current layer’s deltas, but thresholded by the derivative of the source layer’s neurons

This is pretty complicated, and I’ve split out the steps so that it’s possible to step through the running code in the debugger and see what’s going on with the values. But that only gives a level of insight at a single step. how can we show the global behavior of a layer?

Graphing

We are going to graph both the changing value of the neurons and the evolving weights. The neurons are an easier problem so we’ll start there:

def plot_neuron_matrix(self, fig_num: int):
    title = "{} neuron history".format(self.name)
    plt.figure(fig_num)
    np_mat = np.array(self.neuron_history_mat)

    plt.plot(np_mat.T, '-o', linestyle=' ', ms=2)

    names = []
    for i in range(self.num_neurons):
        names.append("neuron {}".format(i))
    plt.legend(names)
    plt.title(title)

This method simply takes the history matrix (where we had a column for each time sample), turns it into a numpy array for easier manipulation and plotting, and plots the transpose (where each neuron’s history is a row). Because the neuron’s values change for each sample, the history of how they converge towards the final values doesn’t show up well with lines, so I set the drawing arguments to points ‘-o’, no line ‘ ‘ , with a point size of 2 pixels (ms=2):

Figure 6: Neuron Histories by layer (click to embiggen)

In the input layer, we see that the neurons are either one or zero, just as we set them. These values are then multiplied by the (initially random) weights and further adjusted by the activation function. Those values ripple through the hidden layer, where they are initially random overthe (0, 1) interval where they are then used to adjust the output neurons (which does not involve an activation function). Over time, you can see the system settle into a state where all neurons are either one or zero, depending on the inputs. So how do the weights achieve this?

Even in this toy system, there are still a lot of weights to keep track of, and I’m still working on a way of visualizing the process. I’m visualizing the weights instead of the neurons, because the weights are the “factors in the equation” that manipulate the “x” values to get a “y”. On other words, I’m watching how the “m” and “b” converge on their values in “y = mx + b”, rather than looking at a particular “x” or “y”.

The method that does this is plot_weight_matrix(), which assembles a chart for each set of weights and is called at the end of simple_nn.py:

def plot_weight_matrix(self, fig_num: int):
    var_name = "weight"
    title = "{} to {} {}".format(self.name, self.target.name, var_name)
    plt.figure(fig_num)
    np_mat = np.array(self.weight_history_mat)

    i = 0
    for row in np_mat.T:
        cstr = "C{}".format(i % self.num_neurons)
        plt.plot(row, linewidth = int(i / self.num_neurons)+1, color=cstr)
        i += 1

    names = []
    num_weights = self.num_neurons * self.target.num_neurons
    for i in range(num_weights):
        src_n = i % self.num_neurons
        targ_n = int(i/self.num_neurons)
        names.append("{} s:t[{}:{}]".format(var_name, src_n, targ_n))
    plt.legend(names)
    plt.title(title)

One of the reasons that I really like OO programming is that so much useful data is associated with the object. You don’t have to go looking for it, or scope things in peculiar ways. As a result, for example, generating the title is simply assembling some strings that I already have lying around.

title = "{} to {} {}".format(self.name, self.target.name, var_name)

The next important step is to get the data that we’ve been assembling in train() into a form that the plotting library likes. The data has been assembled in an list of lists, where each individual list is a snapshot of the weights at one step in the training process. I do it this way because of two reasons:

  1. I don’t know how many steps this process is going to take, and python lists handle dynamic memory allocation nicely.
  2. The weight matrix is a 2D NumPy array, and dealing with a series of matricies is something that PyPlot has no idea how to handle.

Here’s the line from train():

if(self.target != None):
    self.plot_mat.append(self.nparray_to_list(self.weight_row_mat))

PyPlot doesn’t really like to handle lists of lists, but it does know how to handle one big NumPy array, so we convert the list of lists to a matrix where the rows are the weights, and the columns are the timesteps:

np_mat = np.array(self.plot_mat)

At this point we could simply plot everything:

plt.plot(np_mat)

That produces a pretty chart:

input_to_hidden

Figure 7: First pass at drawing a lot of weights

But it’s pretty confusing. There are a lot of lines.  So I did two related things. I set line thickness to be a function of which target neuron and the color of the line to be a function of which source neuron. Using the same scheme, I built the a legend to indicate the source and target neurons that identify each weight, using the coordinates of the matrix – basically treating it as an adjacency matrix.

i = 0
for row in np_mat.T:
    cstr = "C{}".format(i % self.num_neurons)
    plt.plot(row, linewidth = int(i / self.num_neurons)+1, color=cstr)
    i += 1

names = []
num_weights = self.num_neurons * self.target.num_neurons
for i in range(num_weights):
    src_n = i % self.num_neurons
    targ_n = int(i/self.num_neurons)
    names.append("{} s:t[{}:{}]".format(var_name, src_n, targ_n))
plt.legend(names)
plt.title(title)

And that gives a chart that lets us examine what’s going on. All the blue lines are the weights that adjust the value coming from source neuron one, distributed over target neurons [0, 1, 2, 3]. All the thin lines are all the weights that set the value of target neuron one from source neurons [0, 1, 2]:

input2hidden

Figure 8: Weights between input and hidden layers

We now have a way to visualize the whole process inside the layers. Let’s see if we can learn anything by looking at how the neurons and weights coevolve over time.

Some final thoughts

I think the fundamental lesson here is one of gradient descent (or hill climbing if you prefer) from a random initial state to a stable set of values that will set the variables in a function. Once those values are found, the function can do it’s job, which in this case is taking a set of observations – ([ 1, 0, 1 ], [ 0, 1, 1 ], [ 0, 0, 1 ], [ 1, 1, 1 ]) and transforming them to a different set of values – ([1], [1], [0], [0]).

Figure 9: Weights influencing neurons

This is at its core stochastic, a mechanism for harnessing randomness by using rules. The weights and neurons exist in a constrained, multidimensional space. Much of this is fixed before a single iteration – the number of neurons and how they are arranged. The types of connections (activation and derivative functions). The initial value of the weights. Even the manner of input and the “fitness test” that determines the error that is measured. Within these constraints, the weights move slowly under multiple influences until they settle into places that they are no longer forced to move. That’s it.

Variations in this system can be used for all kinds of things, ranging from image recognition to generating words, but the basic process is always the same.

I hope this helped you to read as much it helped me to write!

Full code listings

For the most current versions, please use the GitHub repo, but these are up to date as of January 10, 2019

simple_nn.py

'''
Copyright 2019 Philip Feldman

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and 
associated documentation files (the "Software"), to deal in the Software without restriction, including 
without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 
copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the 
following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial 
portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT 
LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO 
EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR 
THE USE OR OTHER DEALINGS IN THE SOFTWARE.
'''

# based on https://github.com/iamtrask/Grokking-Deep-Learning/blob/master/Chapter6%20-%20Intro%20to%20Backpropagation%20-%20Building%20Your%20First%20DEEP%20Neural%20Network.ipynb
import numpy as np
import matplotlib.pyplot as plt
import src.SimpleLayer as sl

# Methods ---------------------------------------------
# activation function: sets all negative numbers to zero
# Otherwise returns x
def relu(x: np.array) -> np.array :
    return (x > 0) * x

# This is the derivative of the above relu function, since the derivative of 1x is 1
def relu2deriv(output: np.array) -> np.array:
    return 1.0*(output > 0) # returns 1 for input > 0
    # return 0 otherwise

# create a layer
def create_layer(layer_name: str, neuron_count: int, target: sl.SimpleLayer = None) -> 'SimpleLayer':
    layer = sl.SimpleLayer(layer_name, neuron_count, relu, relu2deriv, target)
    layer_array.append(layer)
    return layer

# variables ------------------------------------------
np.random.seed(0)
alpha = 0.2
# the samples. Columns are the things we're sampling, rows are the samples
streetlights_array = np.array( [[ 1, 0, 1 ],
                                [ 0, 1, 1 ],
                                [ 0, 0, 1 ],
                                [ 1, 1, 1 ]])
num_streetlights = len(streetlights_array[0])
num_samples = len(streetlights_array)

# The data set we want to map to. Each entry in the array matches the corresponding streetlights_array row
walk_vs_stop_array = np.array([[1],
                               [1],
                               [0],
                               [0]])

# set up the dictionary that will store the numpy weight matrices
layer_array = []

error_plot_mat = [] # for drawing plots

#set up the layers from last to first, so that there is a target layer
output = create_layer("output", 1)
output.set_neurons([1])
''' # If we want to have four layers (two hidden), use this and comment out the other hidden code below
hidden2 = create_layer("hidden2", 2, output)
hidden2.set_neurons([1, 2])
hidden = create_layer("hidden", 4, hidden2)
hidden.set_neurons([1, 2, 3, 4])
'''
# If we want to have three layers (one hidden), use this and comment out the other hidden code above
hidden = create_layer("hidden", 4, output)
hidden.set_neurons([1, 2, 3, 4])

input = create_layer("input", 3, hidden)
input.set_neurons([1, 2, 3])

for layer in reversed(layer_array):
    print ("--------------")
    print (layer.to_string())

iter = 0
max_iter = 1000
epsilon = 0.001
error = 2 * epsilon
while error > epsilon:
    error = 0
    sample_error_array = []
    for sample_index in range(num_samples):
        input.set_neurons(streetlights_array[sample_index])
        for layer in reversed(layer_array):
            layer.train()

        delta = output.calc_delta(walk_vs_stop_array[sample_index])
        sample_error = np.sum(delta ** 2)
        error += sample_error

        for layer in layer_array:
            layer.learn(alpha)

        # Gather data for the plots
        sample_error_array.append(sample_error)
        # print("{}.{} Error = {:.5f}".format(iter, sample_index, sample_error))

    error /= num_samples
    sample_error_array.append(error)
    error_plot_mat.append(sample_error_array)
    if (iter % 10) == 0 :
        print("{} Error = {:.5f}".format(iter, error))
    iter += 1
    # stop even if we don't converge
    if iter > max_iter:
        break

print("\n--------------evaluation")
for sample_index in range(len(streetlights_array)):
    input.set_neurons(streetlights_array[sample_index])
    for layer in reversed(layer_array):
        layer.train()
    prediction = float(output.neuron_row_array)
    observed = float(walk_vs_stop_array[sample_index])
    accuracy = 1.0 - abs(prediction - observed)
    print("sample {} - input: {} = pred: {:.3f} vs. actual:{} ({:.2f}% accuracy)".
          format(sample_index, input.neuron_row_array, prediction, observed, accuracy*100.0))

# plots ----------------------------------------------
fig_num = 1
f1 = plt.figure(fig_num)
plt.plot(error_plot_mat)
plt.title("error")
names = []
for i in range(num_samples):
        names.append("sample_{}".format(i))
names.append("average")
plt.legend(names)

for layer in reversed(layer_array):
    if layer.target != None:
        fig_num += 1
        layer.plot_weight_matrix(fig_num)

for layer in reversed(layer_array):
    fig_num += 1
    layer.plot_neuron_matrix(fig_num)

plt.show()

SimpleLayer.py

'''
Copyright 2019 Philip Feldman

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and 
associated documentation files (the "Software"), to deal in the Software without restriction, including 
without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 
copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the 
following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial 
portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT 
LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO 
EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR 
THE USE OR OTHER DEALINGS IN THE SOFTWARE.
'''
# based on https://github.com/iamtrask/Grokking-Deep-Learning/blob/master/Chapter6%20-%20Intro%20to%20Backpropagation%20-%20Building%20Your%20First%20DEEP%20Neural%20Network.ipynb
import numpy as np
import matplotlib.pyplot as plt
import types
import typing

# methods --------------------------------------------
class SimpleLayer:
    name = "unset"
    neuron_row_array = None
    neuron_col_array = None
    weight_row_mat = None
    weight_col_mat = None
    weight_history_mat = [] # for drawing plots
    neuron_history_mat = []
    num_neurons = 0
    delta = 0 # the amount to move the source layer
    target = None
    source = None
    activation_func = None
    derivative_func = None

    # set up the layer with the number of neurons, the next layer in the sequence, and the activation/backprop functions
    def __init__(self, name, num_neurons: int, activation_ptr: types.FunctionType, deriv_ptr: types.FunctionType, target: 'SimpleLayer' = None):
        self.reset()
        self.activation_func = activation_ptr
        self.derivative_func = deriv_ptr
        self.name = name
        self.num_neurons = num_neurons
        self.neuron_row_array = np.zeros((1, num_neurons))
        self.neuron_col_array = np.zeros((num_neurons, 1))
        # We only have weights if there is another layer below us
        for i in range(num_neurons):
            self.neuron_history_mat.append([])
        if(target != None):
            self.target = target
            target.source = self
            self.weight_row_mat = 2 * np.random.random((num_neurons, target.num_neurons)) - 1
            self.weight_col_mat = self.weight_row_mat.T

    def reset(self):
        self.name = "unset"
        self.target = None
        self.neuron_row_array = None
        self.neuron_col_array = None
        self.weight_row_mat = None
        self.weight_col_mat = None
        self.weight_history_mat = [] # for drawing plots
        self. neuron_history_mat = []
        self.num_neurons = 0
        self.delta = 0 # the amount to move the source layer
        self.target = None
        self.source = None

    # Fill neurons with values
    def set_neurons(self, val_list: typing.List):
        # print("cur = {}, input = {}".format(self.neuron_array, val_list))
        for i in range(0, len(val_list)):
            self.neuron_row_array[0][i] = val_list[i]
        self.neuron_col_array = self.neuron_row_array.T

    def get_plot_mat(self) -> typing.List:
        return self.weight_history_mat

    # In training, the basic goal is to set a value for the layer's neurons, based on the weights in the source layer mediated by an activation function.
    # This matrix is simply the source neurons times this layer's neurons. For example, if the source layer had three neurons and this layer had four, then
    # the (source) weight matrix would be 3*4 = 12 weights.
    def train(self):
        # if not the bottom layer, we can record values for plotting
        if(self.target != None):
            self.weight_history_mat.append(self.nparray_to_list(self.weight_row_mat))

        # if we're not the top layer, propagate weights
        if self.source != None:
            src = self.source
            # set our neuron values as the dot product of the source neurons, and the source weights
            self.neuron_row_array = np.dot(src.neuron_row_array, src.weight_row_mat)

            # No activation function to output layer
            if(self.target != None):
                # Adjust the values based on the activation function. This introduces nonlinearity.
                # For example, the relu function clamps all negative values to zero
                self.neuron_row_array = self.activation_func(self.neuron_row_array)

            # Transpose the neuron array and save for learn()
            self.neuron_col_array = self.neuron_row_array.T

        # record values for plotting
        for i in range(self.num_neurons):
            self.neuron_history_mat[i].append(self.neuron_row_array[0][i])


    # In learning, the basic goal is to adjust the weights that set this layer's neurons (in this implementation, the source layer). This is done
    # by backpropagating the error delta from this layer to the source layer. Since we only want to adjust the weights that participated in the
    # training, we need to take the derivative of the activation function in train(). Again, the weight matrix is simply the source neurons times
    # this layer's neurons. For example, if the source layer had three neurons and this layer had four, then the (source) weight matrix would be 3*4 = 12 weights.
    def learn(self, alpha):
        # if there is a layer above us
        if self.source != None:
            src = self.source

            # calculate the error delta scalar array, which is the amount this layer needs to change,
            # multiplied across the weights used to set this layer (in the source)
            delta_scalar = np.dot(self.delta, src.weight_col_mat)

            # determine the backpropagation distribution. In the case of Relu, it's just one or zero
            delta_threshold = self.derivative_func(src.neuron_row_array)

            # set the amount the source layer needs to change, based on this layer's delta distributed over the source
            # neurons
            src.delta = delta_scalar * delta_threshold

            # create the weight adjustment matrix by taking the dot product of the source layer's neurons (as columns) and the
            # scaled, thresholded  row of deltas based on this layer's error delta and the source's weight layer
            mat = np.dot(src.neuron_col_array, self.delta)

            # add some percentage of the weight adjustment matrix to the source weight matrix
            src.weight_row_mat += alpha * mat
            src.weight_col_mat = src.weight_row_mat.T

    # given one or more goals (that match the number of neurons in this layer), determine the delta that, when added to the
    # neurons, would reach that goal
    def calc_delta(self, goal: np.array) -> float:
        self.delta = goal - self.neuron_row_array
        return self.delta

    # helper function to turn a NumPy array to a Python list
    def nparray_to_list(self, vals: np.array) -> typing.List[float]:
        data = []
        for x in np.nditer(vals):
            data.append(float(x))
        return data

    def to_string(self):
        target_name = "no target"
        source_name = "no source"
        if self.target != None:
            target_name = self.target.name
        if self.source != None:
            source_name = self.source.name
        return "layer {}: \ntarget = {}\nsource = {}\nneurons (row) = {}\nweights (row) = \n{}".format(self.name, target_name, source_name, self.neuron_row_array, self.weight_row_mat)

    # create a line chart of the plot matrix that we've been building
    def plot_weight_matrix(self, fig_num: int):
        var_name = "weight"
        title = "{} to {} {}".format(self.name, self.target.name, var_name)
        plt.figure(fig_num)
        np_mat = np.array(self.weight_history_mat)

        i = 0
        for row in np_mat.T:
            cstr = "C{}".format(i % self.num_neurons)
            plt.plot(row, linewidth = int(i / self.num_neurons)+1, color=cstr)
            i += 1

        names = []
        num_weights = self.num_neurons * self.target.num_neurons
        for i in range(num_weights):
            src_n = i % self.num_neurons
            targ_n = int(i/self.num_neurons)
            names.append("{} s:t[{}:{}]".format(var_name, src_n, targ_n))
        plt.legend(names)
        plt.title(title)

    def plot_neuron_matrix(self, fig_num: int):
        title = "{} neuron history".format(self.name)
        plt.figure(fig_num)
        np_mat = np.array(self.neuron_history_mat)

        plt.plot(np_mat.T, '-o', linestyle=' ', ms=2)

        names = []
        for i in range(self.num_neurons):
            names.append("neuron {}".format(i))
        plt.legend(names)
        plt.title(title)

Normal Accidents: Living with High-Risk Technologies

Normal Accidents: Living with High-Risk Technologies (1999 ed)

Author

Charles Perrow (Scholar search): An organizational theorist, he is the author of The Radical Attack on Business, Organizational Analysis: A Sociological View, Complex Organizations: A Critical Essay, and Normal Accidents: Living with High Risk Technologies. His interests include the development of bureaucracy in the 19th Century; the radical movements of the 1960s; Marxian theories of industrialization and of contemporary crises; accidents in such high risk systems as nuclear plants, air transport, DNA research and chemical plants; protecting the nation’s critical infrastructure; the prospects for democratic work organizations; and the origins of U.S. capitalism.

Overview

This book describes a type of catastrophic accident that emerges in particular types of complex systems called, “Normal”, because they appear to be inevitable, yet unpredictable as to the specifics.

Normal accidents occur in systems that have three common characteristics

  1. They are tightly coupled. Behavior in one part rapidly affects other parts.
  2. They are densely connected. One part may affect many other parts
  3. Their process are opaque in that they cannot be observed directly and as such behavior must be inferred.

In addition, systems that involve the transformation of their input, such as nuclear power, petrochemical plants, and DNA manipulation are more likely to be tightly coupled, densely connected, and opaque and as such more prone to catastrophe.

A sub-theme is that these systems are always “accidents waiting to happen”, and as such an obvious cause can be found in a post-mortem. However, systems that were inspected months before an accident are often found to be in good working order. Looking more closely, the expectation of finding a problem after a catastrophe makes finding the problem more likely. Looking at a running system through the lens of an imaginary catastrophe should be an effective way to tease out potential problems.

My more theoretical thoughts

Like Meltdown, which was based in part on this book, the systems described here look like networks where the degree and stiffness of connections, combined with an overall velocity of the networked system through belief space (described here as production pressure). That implies that modelling them using something like Graph Laplacians might make sense, though the equations that describe the “weight” of the nodes, the “stiffness” of the edges, and the inertial characteristics of the system as a whole are unclear.

Notes

Introduction

  • Rather, I will dwell upon characteristics of high-risk technologies that suggest that no matter how effective conventional safety devices are, there is a form of accident that is inevitable. (page 3)
  • Most high-risk systems have some special characteristics, beyond their toxic or explosive or genetic dangers, that make accidents in them inevitable, even “normal.” This has to do with the way failures can interact and the way the system is tied together. It is possible to analyze these special characteristics and in doing so gain a much better understanding of why accidents occur in these systems, and why they always will. If we know that, then we are in a better position to argue that certain technologies should be abandoned, and others, which we cannot abandon because we have built much of our society around them, should be modified. (Page 4)
  • No one dreamed that when X failed, Y would also be out of order and the two failures would interact so as to both start a fire and silence the fire alarm. (Page 4)
  • This interacting tendency is a characteristic of a system, not of a part or an operator; we will call it the “interactive complexity” of the system. (Page 4)
  • For some systems that have this kind of complexity, such as universities or research and development labs, the accident will not spread and be serious because there is a lot of slack available, and time to spare, and other ways to get things done. But suppose the system is also “tightly coupled,” that is, processes happen very fast and can’t be turned off, the failed parts cannot be isolated from other parts, or there is no other way to keep the production going safely. Then recovery from the initial disturbance is not possible; it will spread quickly and irretrievably for at least some time. Indeed, operator action or the safety systems may make it worse, since for a time it is not known what the problem really is. (Page 5)
  • these systems require organizational structures that have large internal contradictions, and technological fixes that only increase interactive complexity and tighten the coupling; they become still more prone to certain kinds of accidents. (Page 5)
  • If interactive complexity and tight coupling—system characteristics—inevitably will produce an accident, I believe we are justified in calling it a normal accident, or a system accident. The odd term normal accident is meant to signal that, given the system characteristics, multiple and unexpected interactions of failures are inevitable. (Page 5)
  • The cause of the accident is to be found in the complexity of the system. That is, each of the failures—design, equipment, operators, procedures, or environment—was trivial by itself. Such failures are expected to occur since nothing is perfect, and we normally take little notice of them. (Page 7)
  • Most of the time we don’t notice the inherent coupling in our world, because most of the time there are no failures, or the failures that occur do not interact. But all of a sudden, things that we did not realize could be linked (buses and generators, coffee and a loaned key) became linked. The system is suddenly more tightly coupled than we had realized. (Page 8)
    • Comment: This is the central point in the stampeding of things like self-driving cars. The system is more tightly coupled that we realize
  • It is normal in the sense that it is an inherent property of the system to occasionally experience this interaction. Three Mile Island was such a normal or system accident, and so were countless others that we shall examine in this book. We have such accidents because we have built an industrial society that has some parts, like industrial plants or military adventures, that have highly interactive and tightly coupled units. Unfortunately, some of these have high potential for catastrophic accidents. (Page 8)
  • This dependence is known as tight coupling. On the other hand, events in a system can occur independently as we noted with the failure of the generator and forgetting the keys. These are loosely coupled events, because although at this time they were both involved in the same production sequence, one was not caused by the other. (Page 8)
  • In complex industrial, space, and military systems, the normal accident generally (not always) means that the interactions are not only unexpected, but are incomprehensible for some critical period of time. (Page 8)
  • But if, as we shall see time and time again, the operator is confronted by unexpected and usually mysterious interactions among failures, saying that he or she should have zigged instead of zagged is possible only after the fact. Before the accident no one could know what was going on and what should have been done. Sometimes the errors are bizarre. We will encounter “noncollision course collisions,” for example, where ships that were about to pass in the night suddenly turn and ram each other. But careful inquiry suggests that the mariners had quite reasonable explanations for their actions; it is just that the interaction of small failures led them to construct quite erroneous worlds in their minds, and in this case these conflicting images led to collision. (Page 9)
  • Small beginnings all too often cause great events when the system uses a “transformation” process rather than an additive or fabricating one. Where chemical reactions, high temperature and pressure, or air, vapor, or water turbulence is involved, we cannot see what is going on or even, at times, understand the principles. In many transformation systems we generally know what works, but sometimes do not know why. These systems are particularly vulnerable to small failures that “propagate” unexpectedly, because of complexity and tight coupling. (Page 10)
    • Comment: This is a surprisingly apt description of a deep neural network and a good explanation of why they are risky.
  • High-risk systems have a double penalty: because normal accidents stem from the mysterious interaction of failures, those closest to the system, the operators, have to be able to take independent and sometimes quite creative action. But because these systems are so tightly coupled, control of operators must be centralized because there is little time to check everything out and be aware of what another part of the system is doing. An operator can’t just do her own thing; tight coupling means tightly prescribed steps and invariant sequences that cannot be changed. But systems cannot be both decentralized and centralized at the same time; (Page 10)
    • Comment: And the pressures to make systems efficient leads to tighter coupling and regards diversity as a penalty, which it almost always is in the short term. Diversity prevents catastrophes, but inhibits rewards/profits.
  • when we move away from the individual dam or mine and take into account the larger system in which they exist, we find the “eco-system accident,” an interaction of systems that were thought to be independent but are not because of the larger ecology. (Page 14)
    • Comment: I think financial crashes may also be manifestations of an eco-system accident. Credit-default swaps are similar to DDT in that they concentrate risk and move it up the food chain.

Chapter 1: Normal Accident at Three Mile Island

  • We are now, incredibly enough, only thirteen seconds into the “transient,” as engineers call it. (It is not a perversely optimistic term meaning something quite temporary or transient, but rather it means a rapid change in some parameter, in this case, temperature.) In these few seconds there was a false signal causing the condensate pumps to fail, two valves for emergency cooling out of position and the indicator obscured, a PORV that failed to reseat, and a failed indicator of its position. The operators could have been aware of none of these. (Page 21)
    • Comment: This could be equivalent to a navigation system routing a fleet of cars through a recent disaster, where the road is lethal, but the information about that has not entered the system. (examples)
  • What they didn’t know, and couldn’t know, was that with the PORV open and the two feedwater valves blocked, preventing the removal of residual heat, they already had a LOCA, but not from a pipe break. The rise in pressure in the pressurizer was probably due to the steam voids rapidly forming because the core was close to becoming uncovered. They thought they were avoiding a LOCA when they were in one and were making it worse. With the PORV stuck open, the danger of going solid in the pressurizer was reduced because the open valve would provide some relief. But no one knew it was open. (Page 26)
  • We will encounter this man’s dilemma a few more times in this book; it goes to the core of a common organizational problem. In the face of uncertainty, we must, of course, make a judgment, even if only a tentative and temporary one. Making a judgment means we create a “mental model” or an expected universe. (Page 27)
    • Comment: This is belief space, and these models are often created as stories with sequences. Having a headache can create a story of something temporary that will be fixed with a pill, or something serious that requires a trip to the hospital. The type of story that gets told depends very much on the location in the belief space the user is in. Someone who is worried about a stroke is much more likely to think that a headache is serious and build a narrative around that.
  • Despite the fact that this is no proper test of the appropriateness of alternative B rather than A, it serves to “confirm” your decision. In so believing, you are actually creating a world that is congruent with your interpretation, even though it may be the wrong world. It may be too late before you find that out. (Page 27)
  • These are not expected sequences in a production or safety system; they are multiple failures that interacted in an incomprehensible manner (Page 31)
    • Comment: This could be incomprehensible for any intelligent system, not just human.

Chapter 2: Nuclear Power as a High-Risk System: Why We Have Not Had More TMIs–But Will Soon

  • Shoddy construction and inadvertent errors, intimidation and actual deception—these are part and parcel of industrial life. No industry is without these problems, just as no valve can be made failure-proof. Normally, the consequences are not catastrophic. They may be, however, if you build systems with catastrophic potential. (Page 37)
  • A bit more revealing is another discussion of seven “criticality” accidents. If plutonium, which is exceedingly volatile and hard to machine or handle, experiences the proper conditions, it can attain a self-sustaining fission chain reaction. Criticality depends upon the quantity of the plutonium, the size, shape, and material of the vessel that holds it, the nature of any solvents or dilutants, and even adjacent material, which may reflect neutrons back into the plutonium. It is apparently hard to know when these conditions might be just right. (Page 55)

Chapter 3: Complexity, Coupling, and Catastrophe

  • But there are also degrees of disturbance to systems. The rally will not be disturbed in any perceptible way if I show up with a scratch on my ’57 De Soto, or if, ashamed of the state of my car, I do not go at all. But my system might be greatly disturbed if I stayed home rather than meeting or impressing people. The degree of disturbance, then, is related to what we define as the system. If the rally is the system under analysis, there is no accident. If that part of my life concerned with custom cars is the system, there is an accident. A steam generator tube failure in a nuclear plant can hardly be anything other than an accident for that plant and for that utility. Yet it may or may not have an appreciable effect upon the nuclear power “system” in the United States. (page 64)
  • Most of the work concerned with safety and accidents deals, rightly enough, with what I call first-party victims, and to some extent second-party victims. But in this book we are concerned with third- and fourth-party victims. Briefly, first-party victims are the operators; second-party victims are nonoperating personnel or system users such as passengers on a ship; third-party victims are innocent bystanders; fourth-party victims are fetuses and future generations. Generally, as we move from operators to future generations, the number of persons involved rises geometrically, risky activities are less well compensated, and the risks taken are increasingly unknown ones. (Page 67)
  • The following list presents these and adds the definition of the two types of accidents: component failure accidents and system accidents, which we will now take up. (Page 70)
    • Systems are divided into four levels of increasing aggregation: units, parts, subsystems, and system.
    • Incidents involved damage to or failures of parts or a unit only, even though the failure may stop the output of the system or affect it to the extent that it must be stopped.
    • Accidents involve damage to subsystems or the system as a whole, stopping the intended output or affecting it to the extent that it must be halted promptly.
    • Component failure accidents involve one or more component failures (part, unit, or subsystem) that are linked in an anticipated sequence.
    • System accidents involve the unanticipated interaction of multiple failures.
  • A system accident, in our definition, must have multiple failures, and they are likely to be in reasonably independent units or subsystems. But system accidents, as with all accidents, start with a component failure, most commonly the failure of a part, say a valve or an operator error. It is not the source of the accident that distinguishes the two types, since both start with component failures; it is the presence or not of multiple failures that interact in unanticipated ways. (Page 71)
  • Incidents are overwhelmingly the most common untoward system events. Accidents are far less frequent. Among accidents, component failure accidents are far more frequent than system accidents. I have no reliable way to estimate these frequencies. For the systems analyzed in this book the richest body of data comes from the safety-related failures that nuclear plants in the United States are required to report. Roughly 3,000 Licensee Event Reports are filed each year by the 70 or so plants. Based upon the literature discussing these reports, I estimate 300 of the 3,000 events might be called accidents; 15 to 30 of these might be system accidents. (Page 71)
    • Comment: This looks like a power-law relationship. If true, then it would be a nice way of indicating the frequencies of rarer types of accidents based on the more common types.
  • The notion of baffling interactions is increasingly familiar to all of us. It characterizes our social and political world as well as our technological and industrial world. As systems grow in size and in the number of diverse functions they serve, and are built to function in ever more hostile environments, increasing their ties to other systems, they experience more and more incomprehensible or unexpected interactions. They become more vulnerable to unavoidable system accidents. (Page 72)
    • Comment: The central premise of Stampede Theory: This emergent property of tightly coupled systems is not specific to a technology.
  • These are “linear” interactions: production is carried out through a series or sequence of steps laid out in a line. It doesn’t matter much whether there are 1,000 or 1,000,000 parts in the line. It is easy to spot the failure and we know what its effect will be on the adjacent stations. There will be product accumulating upstream and incomplete product going out downstream of the failure point. Most of our planned life is organized that way. (Page 72)
    • Comment: A linear system is a list of transformations. I think that is also the general form of a story, but not a game or a map. This means that linear systems in physical or belief space are inherently understandable. The trick in belief space is to determine what is important at what scale to achieve meaningful understanding.
  • But what if parts, or units, or subsystems (that is, components) serve multiple functions? For example, a heater might both heat the gas in tank A and also be used as a heat exchanger to absorb excess heat from a chemical reactor. If the heater fails, tank A will be too cool for the recombination of gas molecules expected, and at the same time, the chemical reactor will overheat as the excess heat fails to be absorbed. This is a good design for a heater, because it saves energy. But the interactions are no longer linear. The heater has what engineers call a “common-mode” function—it services two other components, and if it fails, both of those “modes” (heating the tank, cooling the reactor) fail. This begins to get more complex. (Page 73)
  • I will refer to these kinds of interactions as complex interactions, suggesting that there are branching paths, feedback loops, jumps from one linear sequence to another because of proximity and certain other features we will explore shortly. The connections are not only adjacent, serial ones, but can multiply as other parts or units or subsystems are reached. (Page 75)
    • Comment: Modeled by graph Laplacians
  • To summarize our work so far: Linear interactions are those interactions of one component in the DEPOSE system (Design, Equipment, Procedures, Operators, Supplies and materials, and Environment) with one or more components that precede or follow it immediately in the sequence of production. Complex interactions are those in which one component can interact with one or more other components outside of the normal production sequence, either by design or not by design. (Page 77)
  • These problems exist in all industrial and transportation systems, but they are greatly magnified in systems with many complex interactions. This is because interactions, caused by proximity, common mode connections, or unfamiliar or unintended feedback loops, require many more probes of system conditions, and many more alterations of the conditions. Much more is simply invisible to the controller. The events go on inside vessels, or inside airplane wings, or in the spacecraft’s service module, or inside computers. Complex systems tend to have elaborate control centers not because they make life easier for the operators, saving steps or time, nor because there is necessarily more machinery to control, but because components must interact in more than linear, sequential ways, and therefore may interact in unexpected ways. (Page 82)
  • In complex systems, where not even a tip of an iceberg is visible, the communication must be exact, the dial correct, the switch position obvious, the reading direct and “on-line.” (Page 84)
    • Comment: True for any stiff, dense system
  • The problem of indirect or inferential information sources is compounded by the lack of redundancy available to complex systems. If we stopped to notice, we would observe that our daily life is full of missed or misunderstood signals and faulty information. A great deal of our speech is devoted to redundancy—saying the same thing over and over, or repeating it in a slightly different way. We know from experience that the person we are talking to may be in a different cognitive framework, framing our remarks to “hear” that which he expects to hear, not what he is being told. The listener suppresses such words as “not” or “no” because he doesn’t expect to hear them. Indeed, he does not “hear” them, in the literal sense of processing in his brain the sounds that enter the ear. All sorts of trivial misunderstandings, and some quite serious ones, occur in normal conversation. We should not be surprised, then, if ambiguous or indirect information sources in complex systems are subject to misinterpretation. (Page 84)
  • accidents continue to plague transformation processes that are fifty years old. These are processes that can be described, but not really understood. They were often discovered through trial and error, and what passes for understanding is really only a description of something that works. (Page 85)
    • Comment: This is also true with machine learning. We don’t know why these algorithms are so effective and what their limits are. Or even their points of greatest sensitivity.
  • To summarize, complex systems are characterized by (Page 86):
    • Proximity of parts or units that are not in a production sequence;
    • many common mode connections between components (parts, units, or subsystems) not in a production sequence;
    • unfamiliar or unintended feedback loops;
    • many control parameters with potential interactions;
    • indirect or inferential information sources; and
    • limited understanding of some processes.
  • Linear systems lack the common-mode connections that require proximity. It is also a design criterion to separate various stages of production for sheer ease of maintenance access or replacement of equipment. Linear systems not only have spatial segregation of separate phases of production, but within production sequences the links are few and sequential, allowing damaged components to be pulled out with minimal disturbance to the rest of the system. (Page 86)
  • Though I don’t want to claim a vast difference between employees in complex and linear systems, the latter appear to have fewer specialized and esoteric skills, allowing more awareness of interdependencies if they appear. The welder in a nuclear plant is more specialized (and specially rated), and presumably more isolated from other personnel, than the welder in a fabrication plant. Specialized personnel tend not to bridge the wide range of possible interactions; generalists, rather than specialists, are perhaps more likely to see unexpected connections and cope with them. (Page 87)
    • Comment: Linear systems are lower dimension. In fact, what is called complex here is actually high dimension sometimes and truly complex at others. Feedback Loops are a sign of true complexity
  • Complex-Linear(page 88)
  • Loosely coupled systems, whether for good or ill, can incorporate shocks and failures and pressures for change without destabilization. Tightly coupled systems will respond more quickly to these perturbations, but the response may be disastrous. Both types of systems have their virtues and vices. (Page 92)
  • Loosely coupled systems are said to have “equifinality”—many ways to skin the cat; tightly coupled ones have “unifinality.” (Page 94)
    • Comment: A tightly coupled system can only behave as a single individual, made up of tightly connected parts (a body). Loosely coupled systems behavior is more like populations. In humans (most animals?), the need for novelty provides velocity to the members of the population and is also a driver for coordination among systems that would otherwise be much more loosely coupled?
  • Tightly coupled systems have little slack. Quantities must be precise; resources cannot be substituted for one another; wasted supplies may overload the process; failed equipment entails a shutdown because the temporary substitution of other equipment is not possible. No organization makes a virtue out of wasting supplies or equipment, but some can do so without bringing the system down or damaging it. In loosely coupled systems, supplies and equipment and human power can be wasted without great cost to the system. Something can be done twice if it is not correct the first time; one can temporarily get by with lower quality in supplies or products in the production line. The lower quality goods may have to be rejected in the end, but the technical system is not damaged in the meantime. (Page 94)
  • In tightly coupled systems the buffers and redundancies and substitutions must be designed in; they must be thought of in advance. In loosely coupled systems there is a better chance that expedient, spur-of-the-moment buffers and redundancies and substitutions can be found, even though they were not planned ahead of time. (page 95)
    • Comment: This is why diversity matters, it loosens the coupling. The question is to determine a) How much friction is needed for a given system, and b) what is the best way to inject diversity or other forms of friction.
  • Tightly coupled systems are not completely devoid of unplanned safety devices. In two of the most famous nuclear plant accidents, Browns Ferry and TMI, imaginative jury-rigging was possible and operators were able to save the systems through fortuitous means. At TMI two pumps were put into service to keep the coolant circulating, even though neither was designed for core cooling. Subjected to intense radiation they were not designed to survive; one of them failed rather quickly, but the other kept going for days, until natural circulation could be established. Something more complex but similar took place at Browns Ferry. The industry claimed that the recovery proved that the safety features worked; but the designed-in ones did not work. (Page 95)
  • The placement of systems is based entirely on subjective judgments on my part; at present there is no reliable way to measure these two variables, interaction and coupling. (Page 96)
    • Comment: Graph Laplacians? The system is a network. That network can be tiny or huge. Characterizing The nodes and edges at scale is difficult.

Chapter 4: Petrochemical Plants

  • Our discussion of nuclear power plants should make the following observations familiar enough. There was organizational ineptitude: they were knowingly short of engineering talent, and the chief engineer had left; there was a hasty decision on the by-pass, a failure to get expert advice, and most probably, strong production pressures. But as was noted in the case of TMI and other nuclear plants, and as will be apparent from other chapters, this is the normal condition for organizations; we should congratulate ourselves when they manage to run close to expectations. Had the pipe held out a short time longer until the reactor was repaired, and a new chief engineer been hired, and had a governmental inquiry board then come around, they could have concluded it was a well-run, safe plant. Once there is an accident, one looks for and easily finds the great causes for the great event. There were unheeded warning signals—the by-pass pipe had been noted to move up and down slightly during operation, surely an irregular bit of behavior; and there were unexplained anomalies in pressure and temperature and hydrogen consumption. But these were warnings only in retrospect. In these complex systems, minor warnings are probably always available for recall once there is an occasion, but if we shut down for every little thing… (Page 112)
  • The report also faults the operators for not closely watching the indicators that monitored the flow of synthesis gas through the heater, which would have disclosed that something was wrong. But the report goes on to acknowledge that “the flow indicators were considered unreliable because there was hardly any indication of flow during both normal operations and in start-up conditions, especially when starting up both converters simultaneously,” as they were now doing. (Note this trivial interdependency of the two converters and their effect upon flow.) The operators, then, were blamed for not monitoring a flow that was so faint it could not be reliably measured. It hardly mattered in any case, since both flow indicators had been set incorrectly, and furthermore the alarms for them, indicating low flow, “had been disarmed since they caused nuisance problems during normal operations.”26 Small wonder, after all this, that the flow indicators “were not closely monitored.” (Page 116)
  • High temperature alarms would occur, but the operators learned to ignore them. (Page 118)
  • After the paper was presented, a discussant noted it is not the newness of the plant that is the problem. Even in the older plants, he said, “We struggle to control it -Runaways will take place and control by these caps is not the answer…. The way it is now we are in difficulties and I don’t think anybody is sophisticated enough to operate the plant safely.”[33] The problems in this mature, but increasingly sophisticated and high volume industry, appear to lie in the nature of the highly interactive, very tightly coupled system itself, not in any design or equipment deficiencies that humans might overcome. (Page 120)

Chapter 5: Aircraft and Airways

  • Thirty-one of the first forty pilots were killed in action, trying to meet the schedules for business and government mail. It may be the first nonmilitary example of a phenomenon that will concern us in this chapter: production pressures in this high-risk system. Though nothing comparable exists today in commercial aviation, such pressures, are, as we shall see, far from negligible. (Page 125)
    • Comment: Production pressure behaves as a global stiffener.
  • Dependence upon multiple function units or subsystems was reduced by the segregation of the traffic (as well as by the use of transponders). More corridors were set up and restricted to certain kinds of flights. Small aircraft with low speeds (and without instrument flying equipment) were excluded from the altitudes where the fast jets fly (though controllers relent on this point, as we have just seen). Military flights were restricted to certain areas; parachutists were controlled. In this way, the system was made more linear. Of course, the density in any one corridor probably also has increased, offsetting the gain to some or even to a great degree. But if we could control for density, we would expect to find a decrease in unexpected interactions. (Page 159)
  • Tight coupling reduces the ability to recover from small failures before they expand into large ones. Loose coupling allows recovery. In ATC processing delays are possible; aircraft are highly maneuverable and in three-dimensional space, so an airplane can be told to hold a pattern, to change course, slow down, speed up, or whatever. The sequence of landing or takeoff or insertion into a long-distance corridor is not invariant, though flexibility here certainly has its limits. The creation of more corridors reduced the coupling as well as the complexity of this system. Time constraints are still tight; the system is not loosely coupled, only moderately tightly coupled. But aircraft are maneuverable. They are also quite small. Near misses generally concern spaces of 200 feet to one mile. Those near misses reported to be under 100 feet are exceedingly rare and the proximity may be exaggerated. Even with 100 feet, though, there are 99 spare feet. If one tried, it would be hard to make two aircraft collide. In other high-risk systems it is comparatively easy to produce an explosion, or to defeat key safety systems and produce a core melt. (Page 160)
  • But in contrast to nuclear power plants and chemical plants (and recombinant DNA research), the system is not a transformation system, with hidden and poorly understood interactions that respond to indirect controls with indirect indicators. (An exception occurs in encountering buffet boundaries.) (Page 167)

Chapter 6: Marine Accidents

  • The problem, it seems to me, lies in the type of system that exists. I will call it an “error-inducing” system; the configuration of its many components induces errors and defeats attempts at error reduction. (Page 172)
    • Comment: I think social networks tend towards “error-inducing”
  • The notion of an error-inducing system itself is derived from the complexity and coupling concepts. It sees some aspects as too loosely coupled (the insurance subsystem and shippers), others as too tightly coupled (shipboard organization); some aspects as too linear (shipboard organizations again, which are highly centralized and routinized), others as too complexly interactive (supertankers, and also the intricate interactions among marine investigations, courts, insurance agencies, and shippers). (Page 175)
  • One major example, that of what I will call “noncollision course collisions,” will raise a problem we have encountered before, the social construction of reality, or building cognitive models of ambiguous situations. Why do two ships that would have passed in the night suddenly turn on one another and collide? (Page 176)
  • The truckers do not want to use low gears and crawl down the other side of the Donner Summit, they want to go as fast as the turns and the highway patrol will allow them. Granted there are irresponsible drivers (irresponsible to themselves and their families as well as to third-party victims), just as all of us are irresponsible at times. But the work truckers do puts even irresponsible drivers in a situation where irresponsibility will have graver consequences than it does for most of us. Again, it is the system that must be analyzed, not the individuals. How should this system be designed to reduce the probability or limit the consequences of situations where irresponsibility can have an effect? (Page 180)
    • Comment: My sense is that this is almost always a user-interface problem, where the interests of the larger population need to be factored in while still providing meaningful freedom (of choice?) to the individual.
  • The encouragement of risk induces the owners and operators of the system to discard the elements of linearity and loose coupling that do exist, and to increase complexity and tight coupling. To some degree this discarding of intrinsic safety features occurs in all systems, but it appears to me to be far more prevalent in the marine system than in, say, nuclear power production and chemical production. I think the difference lies in the technological and environmental aspects of the system (fewer fixes available and a more severe environment), in its social organization (authority structure), and its catastrophic potential (which is less in the marine case, thus inviting less public intervention). (Page 187)
    • Comment: Each one of these components is a different network of weights connecting the same nodes:
      • Improvisational repairability
      • Environment
      • Social Organization
      • Catastrophic potential
  • We are led back to production pressures again. Even if CAS and radiotelephone communications (and inertial guidance systems, et cetera) would appear to reduce accidents in these studies, they also appear to increase speed and risk-taking, because the accident rate is growing steadily. Production pressures defeat the safety ends of safety devices and increase the pressures to use the devices to reduce operating expenses by going faster, or straighter; this makes the maritime system as a whole more complex (proximity, limited understanding) and tightly coupled (time dependent functions, limited slack). (Page 206)
    • Comment: This is a really important externality that will have a huge impact in robotic systems where there is no need to consider another human. This could easily lead to a runaway condition that is completely invisible until too late. Sort of a more industrial, less dramatic SkyNet example.
  • When we do dumb things in our car occasionally, we get an insight into how deck officers might do the same. Why do we, as drivers, or deck officers on ships, zig when we should have zagged, even when we are attentive and can see? I don’t know the many answers, but the following material will suggest that we construct an expected world because we can’t handle the complexity of the present one, and then process the information that fits the expected world, and find reasons to exclude the information that might contradict it. (Page 213)
    • Comment: Is there neurological support for this?
  • As drivers, we all would probably admit that at times we took unnecessary risks; but what we say to ourselves and others is, “I don’t know why; it was silly, stupid of me.” We generally do not do it because it was exciting. Finally, we cannot rule out exhaustion, or inebriation. Both exist. But neither are mentioned in the accident reports, and more important, from my own experience as a driver, skier, sailor, and climber, I know that I do inexplicable things when I am neither exhausted nor inebriated. So, in conclusion, I am arguing that constructing an expected world, while it begs many questions and leaves many things unexplained, at least challenges the easy explanations such as stupidity, inattention, risk taking, and inexperience. (Page 213)
    • Comment: We do inexplicable things online as well, based on constructed, social reality.
  • But what strikes me about this event is how well it exemplifies the easy way in which we can construct an interpretation of an ambiguous situation, process new information in the light of that interpretation—thus making the situation conform to our expectations—and, when distracted by other duties, make a last minute “correction” that fits with the private reality that no one else shares. (Page 217)
    • Comment: This is why we can have so many histories. They are almost as malleable as fiction. E.g. Rashamon effect
  • Why would a ship in a safe passing situation suddenly turn and be impaled by a cargo ship four times its length? For the same reasons the operators of the TMI plant cut back on high pressure injection and uncovered the core. Confronted with ambiguous signals, the safest reality was constructed. In the above marine case this view of reality assumed that the other ship was not a head-on collision threat. (Page 217)
    • Comment: I think that the constructed reality is the one that is most aligned with previous states that takes the least computation to construct. “Things will be fine” is always less computation than planning for catastrophe. And it’s usually the right answer. So why do some people spend their time exploring the space of potential catastrophes? I think this is explore/exploit management at the population level.
  • As complicated as nuclear power and chemical plants are, the complexities are contained within the hardware and the human-machine interface. We may not grasp the functions of superdeheaters, but we know where they are and that they must function on call. In the aircraft and the airways systems, we let more of the environment in. It complicates the situation a bit. The operating envelope of air becomes a problem, as do thunderclouds, windshears, and white-outs. (Page 229)
    • Comment: As long as there are no problems, nuclear reactors are a manufactured social construct, where the environmental reality (the core) is deeply hidden and ignorable. The weather is present and unavoidable. The environment helps to ground the aircraft systems.
  • Marine transport appears to be an error-inducing system, where perverse interconnections defeat safety goals as well as operating efficiencies. Technological improvements did increase output but probably have helped increase accidents; with radar, the ship can go faster; when two ships have radar, they are even more likely to collide. The equivalent of CDTI (cockpit display of traffic information), which is to be introduced into the airliners, already exists at sea, and is useful only a small percent of the time, and may sometimes be counterproductive. Anyway, despite the increasingly sophisticated equipment, captains still inexplicably turn at the last minute and ram each other. We hypothesized that they built perfectly reasonable mental models of world, which work almost all the time, but occasionally turn out to be almost an inversion of what really exists. The authoritarian structure aboard ship, perhaps functional for simpler times, appears to be inappropriate for complex ships in complex situations. Yet it may be sustained by the shipping industry and the insurance industry who need to determine liability almost as much as they need to stem the increase in accidents. It is reinforced by the “technological fix” which says, “Just give the leader more information, more accurately and faster.” (Page 230)

Chapter 7: Earthbound Systems: Dams, Quakes, Mines and Lakes

  • Overall comment: This chapter covers systems that are embedded in a large, external, mostly observable reality. This does seem to make it more difficult for a complex accident to occur.
  • The U.S. Bureau of Mines now calls for multishaft ventilation and for segmented ventilation; this eliminates many common mode failures where the whole air supply is endangered by a collapse of one ventilation shaft or the failure of one set of fans. (Page 250)
    • Comment: An example of legislated diversity injection. There is no idea if a particular ventilation shaft will fail, but the risk of all failing is much lower, since they are not meaningfully connected. The legislation doesn’t have to know about mining per se, it can look at human survival needs (exits, food, air, water) and demand that there are redundant, distributed sources for those.
  • Both dams and mining, and certainly the underground mining and surface drilling that produced the Lake Peigneur accident, have alerted us to the possibility of ecosystem accidents—the unanticipated expansion of the system and thus the scope of failures. Systems not thought to be linked suddenly are. (Page 254)
  • Large dams and reservoirs create a complex new environment, and very little is known of the mutual interactions of the component forces, on a long-term basis. (Page 255)
    • Also applies to global climate forcing

Chapter 8: Exotics: Space, Weapons, and DNA

  • The space missions illustrate that even where the talent and the funds are ample, and errors are likely to be displayed before a huge television audience, system accidents cannot be avoided. I have argued throughout this book that we should give all risky systems more quality control and training than we do, but also that where complexity and coupling lie, it will not be enough. We gave the space missions everything we had, but the system accidents still occurred. This is not a system with catastrophic potential; the victims are first-party victims. Catastrophic potential resides in most, but not all, complex and tightly coupled systems. (Page 257)
  • The centerpiece of this part of the chapter will be an account of an extraordinary mission, Apollo 13, which commenced with a system accident, and ended with a recovery that dramatically illustrates the most exemplary attributes of both humans and their machines. The event will tell us something further about complexity and coupling: the recovery was possible because ground controllers were able to make the system more linear and more loosely coupled, and to put the operators back into the control loop that rarely included them. (Page 257)
  • because of the safety systems involved in a launch-on-warning scenario, it is virtually impossible for well-intended actions to bring about an accidental attack (malevolence or derangement is something else). In one sense this is not all that comforting, since if there were a true warning that the Russian missiles were coming, it looks as if it would also be nearly impossible for there to be an intended launch, so complex and prone to failure is this system. It is an interesting case to reflect upon: at some point does the complexity of a system and its coupling become so enormous that a system no longer exists? Since our ballistic weapons system has never been called upon to perform (it cannot even be tested), we cannot be sure that it really constitutes a viable system. (Page 258)
  • In Chapter 2, I argued that every industrial activity exhibits organizational failures, incompetence, greed, and some criminality. (Page 258)
    • Comment: And this needs to be factored into any design , particularly the design of complex systems
  • The search for the problem was conducted with the well-worn assumption we have been exploring in this book: since the system is safe, or I wouldn’t be here, it must be a minor problem, or the lesser of two possible evils. Just as reactor cores had never been uncovered before, or that ship would never turn this way, or they would never set my course to hit a mountain, the idea that the heart of the spaceship might be broken was inconceivable, particularly for the managers and designers in Houston, all gathered together for the historic flight. Cooper puts it well: (Page 274)
    • …they felt secure in the knowledge that the spacecraft was as safe a machine for flying to the moon as it was possible to devise. Obviously, men would not be sent into space in anything less, and inasmuch as men were being sent into space, the pressure around NASA to have confidence in the spacecraft was enormous. Everyone placed particular faith in the spacecraft’s redundancy: there were two or more of almost everything.
    • Comment: “…almost..“. The complex behavior forms around the stiff links. This can be modeled, I think.
  • The astronauts felt a jolt and this remained an important part of their analysis throughout the process of trying to track down the problem. (Page 277)
    • Comment: Environmental awareness as another, non-social information channel
  • The accident allows us to review some typical behavior associated with system accidents:
    • initial incomprehension about what was indeed failing;
    • failures are hidden and even masked;
    • a search for a de minimus explanation, since a de maximus one is inconceivable;
    • an attempt to maintain production if at all possible;
    • mistrust of instruments, since they are known to fail;
    • overconfidence in ESDs and redundancies, based upon normal experience of smooth operation in the past;
    • ambiguous information is interpreted in a manner to confirm initial (de minimus) hypotheses;
    • tremendous time constraints, in this case involving not only the propagation of failures, but the expending of vital consumables; and
    • invariant sequences, such as the decision to turn off a subsystem that could not be restarted.
  • All this did not just take place with a few high-school graduates with some drilling in reactor procedures, or a crusty old sea captain isolated in his absolute authority, but happened with three brilliant and extremely well-trained test pilots and a gaggle of managers (scientists and engineers all) backed up by the “Great Designers” themselves, all working shifts in Houston and wired to the spacecraft. (page 278)
  • In addition, almost every step they devised could be quickly and realistically tested in a very sophisticated simulator before it was tried out in the capsule—something rarely available to high-risk systems. (Page 278)
    • Comment: In addition to diversity, simulators that afford exploring a space should be a requirement of resilient systems
  • With multiple and independent sources of information, the more detailed the information, the more unlikely an error. (Page 289)
    • Comment: This is a key aspect of diversity injection: diverse, easy to test bits of information that build a larger world view. With respect to the mineshaft legislation earlier, it’s easy to see if ventilation is coming from a shaft.
  • With ecosystem accidents the risk cannot be calculated in advance and the initial event—which usually is not even seen as a component failure at all—becomes linked with other systems from which it was believed to be independent. The other systems are not part of any expected production sequence. The linkage is not only unexpected but once it has occurred it is not even well understood or easily traced back to its source. Knowledge of the behavior of the human-made material in its new ecological niche is extremely limited by its very novelty. Ecosystem accidents illustrate the tight coupling between human made systems and natural systems. There are few or no deliberate buffers inserted between the two systems because the designers never expected them to be connected. At its roots, the ecosystem accident is the result of a design error, namely the inadequate definition of system boundaries. (Page 296)

Chapter 9: Living with High-Risk Systems

  • Ultimately, the issue is not risk, but power; the power to impose risks on the many for the benefit of the few. (Page 306)
  • When societies confront a new or explosively growing evil, the number of risk assessors probably grows—whether they are shamans or scientists. (Page 307)
  • heroic efforts would be needed to educate the general public in the skills needed to decide complex issues of risk. At the basis of this is a quarrel about forms of rationality in human affairs. (Page 315)
    • Comment: A good place for games and simulations?
  • It is convenient to think of three forms of rationality: absolute rationality, which is enjoyed primarily by economists and engineers; “bounded” or limited rationality, which a growing wing of risk assessors emphasize; and what I will call social and cultural rationality, which is what most of us live by, although without thinking that much about it. (Page 315)
    • Comment: These map roughly onto my groupings, though “Absolute” would be “environmental”
      • Absolute rationality – Nomadic
      • Bounded rationality – Flocking
      • Socio-cultural rationality – Stampede
  • One important and unintended conclusion that does come from this work is the overriding importance of the context into which the subject puts the problem. Recall our nuclear power operators, or the crew of the New Zealand DC-10 on its sightseeing trip, or the mariners interpreting ambiguous signals. The decisions made in these cases were perfectly rational; it was just that the operators were using the wrong contexts. Selecting a context (“this can happen only with a small pipe break in the secondary system”) is a pre-decision act, made without reflection, almost effortlessly, as a part of a stream of experience and mental processing. We start “thinking” or making “decisions” based upon conscious, rational effort only after the context has become defined. And defining the context is a much more subtle, self-steering process, influenced by long experience with trials and errors (much as the automatic adjustments made in driving or walking on a busy street). If a situation is ambiguous, without thinking about it or deciding upon it, we sometimes pick what seems to be the most familiar context, and only then do we begin to consciously reason. This is what appears to happen in a great many of the psychological experiments. Without conscious thought (the kind that can be easily and fairly accurately recalled), the subject says, “This is like x: I will do what I usually do then.” The results of these experiments strongly suggest that the context supplied by the subjects is not the context the experimenter expected them to supply. With an ill-defined context, the subject of the experiment may say, “Oh, this is like situation A in real life, and this is what I generally do,” while the experimenter thinks that the subject is assuming it is like situation B in real life, and is surprised by what the subject does. (Page 318)
    • Comment: This aligns with ‘Conflict and Consensus’ and is a form of dimension reduction
  • Finally, heuristics are akin to intuitions. Indeed, they might be considered to be regularized, checked-out intuitions. An intuition is a reason, hidden from our consciousness, for certain apparently unrelated things to be connected in a causal way. Experts might be defined as people who abjure intuitions; it is their virtue to have flushed out the hidden causal connections and subjected them to scrutiny and testing, and thus discarded them or verified them. Intuitions, then, are especially unfortunate forms of heuristics, because they are not amenable to inspection. This is why they are so fiercely held even in the face of contrary evidence; the person insists the evidence is irrelevant to their “insight.” (Page 319)
    • Comment: A nice distinction between Nomad and Stampede framings
  • Finally, bounded rationality is efficient because it avoids an extensive amount of effort. For the citizen, think of the work that could be required to decide just what the TMI accident signified. In the experts’ view, the public should make the effort to answer the following questions, and if they can’t, should accept the experts’ answers. Did the accident fit in the technical fault tree analysis the experts constructed in WASH-1400, the Rasmussen Report (a matter of several volumes of technical writing)? How many times in the past have we come close to an accident of this type? Can we correct the system and thus learn from the accident? Was it accurately reported? Do the experts agree on what happened? Did it fit the base rate of events that led to the prediction that it was a very rare event? And so on. The experts do not have answers to some of these questions, so the public, even were they to devote some months of study to the problem, could not be assured that the answer could be known. (Page 320)
    • Comment: Intelligence is computation, and expensive.
  • The most important factor, which they labeled “dread risk,” was associated with (Page 326):
    • lack of control over the activity;
    • fatal consequences if there were a mishap of some sort;
    • high catastrophic potential;
    • reactions of dread;
    • inequitable distribution of risks and benefits (including the transfer of risks to future generations), and
    • the belief that risks are increasing and not easily reducible.
    • Comment: These clusters are a form of dimension reduction that is emerges spontaneously in a human population. Perceived risk: psychological factors and social implications
  • The dimension of dread—lack of control, high fatalities and catastrophic potential, inequitable distribution of risks and benefits, and the sense that these risks are increasing and cannot be easily reduced by technological fixes—clearly was the best predictor of perceived risk. This is what we might call, after Clifford Geertz, a “thick description” of hazards rather than a “thin description.” A thin one is quantitative, precise, logically consistent, economical, and value-free. It embraces many of the virtues of engineering and the physical sciences, and is consistent with what we have called component failure accidents—failures that are predictable and understandable and in an expected production sequence. A thick description recognizes subjective dimensions and cultural values and, in this example, shows a skepticism about human-made systems and institutions, and emphasizes social bonding and the tentative, ambiguous nature of experience. A thick description reflects the nature of system accidents, where unanticipated, unrecognizable interactions of failures occur, and the system does not allow for recovery. (Page 328)
  • But it is fair to ask whether we have progressed enough as a species to handle the more immediate, short-term problems of DNA, chemical plants, nuclear plants, and nuclear weapons. Recall the major thesis of this book: systems that transform potentially explosive or toxic raw materials or that exist in hostile environments appear to require designs that entail a great many interactions which are not visible and in expected production sequence. Since nothing is perfect—neither designs, equipment, operating procedures, operators, materials, and supplies, nor the environment—there will be failures. If the complex interactions defeat designed-in safety devices or go around them, there will be failures that are unexpected and incomprehensible. If the system is also tightly coupled, leaving little time for recovery from failure, little slack in resources or fortuitous safety devices, then the failure cannot be limited to parts or units, but will bring down subsystems or systems. These accidents then are caused initially by component failures, but become accidents rather than incidents because of the nature of the system itself; they are system accidents, and are inevitable, or “normal” for these systems. (Page 330)
  • Accidents will be avoided if the system is also loosely coupled, (cell 4, universities and R&D units) because loose coupling gives time, resources, and alternative paths to cope with the disturbance and limits its impact. (Page 331)
  • For organizations that are both linear and loosely coupled (cell 3; most manufacturing, and single goal agencies such as a state liquor authority, vehicle registration, or licensing bureau) centralization is feasible because of the linearity, but decentralization is feasible because of the loose coupling. Thus, these organizations have a choice, insofar as organizational structure affects recovery from inevitable failures. The fact that most have opted for centralized structures says a good bit about the norms of elites who design these systems, and perhaps subtle matters such as the “reproduction of the class system” (keeping people in their place). Organizational theorists generally find such organizations overcentralized and recommend various forms of decentralization for both productivity and social rationality reasons. (Page 334)
    • Comment: Designed in inefficiently is good for preventing accidents but bad for profit
  • Organizational theorists have long since given up hope of finding perfect or even exceedingly well-run organizations, even where there is no catastrophic potential. It is an enduring limitation—if it is a limitation—of our human condition. It means that humans do not exist to give their all to organizations run by someone else, and that organizations inevitably will be run, to some degree, contrary to their interests. This is why it is not a problem of “capitalism”; socialist countries, and even the ideal communist system, cannot escape the dilemmas of cooperative, organized effort on any substantial scale and with any substantial complexity and uncertainty. At some point the cost of extracting obedience exceeds the benefits of organized activity. (Page 338)
    • Comment: First point: should we expect any better of machine organizations? Why? Second point: These are curves that cross in a fuzzy way that incorporate short term profit and long term sustainability.
  • What, then, of a less ambitious explanation than capitalism, namely greed (whether due to human nature or structural conditions)—or to put it more analytically, private gain versus the public good? (Page 340)
    • Comment: There is something here. It is not anyone’s private gain, but the gain of the powerful, the amplified. The ability of one person to have the effect of a stampede. In capitalism, this can be achieved using money. In other systems it’s birth, luck, access to power, etc. It can be very good for the few to drive the many over the cliff…
  • Systems that have high status, articulate, and resource-endowed operators are more likely to have the externalities brought to public attention (thus pressuring the system elites) than those with low-status, inarticulate, and impoverished operator groups. To protect their own interests, the operators in the first category can argue the public interest, since both operators and the public will suffer from externalities. The contrast here runs from airline pilots at the favorable extreme, through chemical plant operators, then nuclear plant operators, to miners—the weakest group. (Page 342)
    • Comment: We may only get one 10,000-car pileup. But think about what’s going on in the ML transformation of the justice system
  • The main point of the book is to see these human constructions as systems, not as collections of individuals or representatives of ideologies. From our opening accident with the coffeepot and job interview through the exotics of space, weapons, and microbiology, the theme has been that it is the way the parts fit together, interact, that is important. The dangerous accidents lie in the system, not in the components. The nature of the transformation processes elude the capacities of any human system we can tolerate in the case of nuclear power and weapons; the air transport system works well—diverse interests and technological changes support one another; we may worry much about the DNA system with its unregulated reward structure, less about chemical plants; and though the processes are less difficult and dangerous in mining and marine transport, we find the system of each is an unfortunate concatenation of diverse interests at cross-purposes. (Page 351)
    • Comment: Systems are directed networks. But then ideologies are too. They run at different scales though.
  • These systems are human constructions, whether designed by engineers and corporate presidents, or the result of unplanned, unwitting, crescive, slowly evolving human attempts to cope. (Page 351)
    • Comment: These are the kinds of systems that partially informed agent make. They are mechanisms that populations use to make decisions
  • The catastrophes send us warning signals. This book has attempted to decode these signals: abandon this, it is beyond your capabilities; redesign this, regardless of short-run costs; regulate this, regardless of the imperfections of regulation. But like the operators of TMI who could not conceive of the worst—and thus could not see the disasters facing them—we have misread these signals too often, reinterpreting them to fit our preconceptions. (Page 351)

Afterward:

  • But a little reflection on Bhopal led me to invert “Normal Accident Theory” in order to see how this tragedy was possible. A few hundred plants with this catastrophic potential could quietly exist for forty years without realizing their potential because it takes just the right combination of circumstances to produce a catastrophe, just as it takes the right combination of inevitable errors to produce an accident. I think this deserves to be called the “Union Carbide factor.” (Page 356)
    • Comment: Power law again?
  • It is hard to have a catastrophe; everything has to come together just right (or wrong). When it does we have “negative synergy.” Since catastrophes are rare, elites, I conclude, feel free to populate the earth with these kinds of risky systems. (Page 358)
    • Comment: There is some level of implicit tradeoff in bringing incompletely understood technologies to market. “move fast and break things” is an effective way to get past the valley of death as fast as possible – i.e. “Production Pressure”
  • Both our predictions about the possibilities of accidents and our explanations of them after they occur are profoundly compromised by our act of “social construction.” We do not know what to look for in the first place, and we jump to the most convenient explanation (culture, or bad conditions) in the second place. If an accident has occurred, we will find the most convenient explanations. (Page 359)
    • Comment: “Rush to judgement” has a long history. From the OED – Ld. Erskine Speeches Misc. Subj. (1812) 7 “An attack upon the King is considered to be patricide against the state, and the Jury and the witnesses, and even the Judges, are the children. It is fit, on that account, that there should be a solemn pause before we rush to judgment.”
  • Were we to perform a thought experiment, in which we would go into a plant that was not having an accident but assume there had just been one, we would probably find “an accident waiting to happen” (Page 359)
    • Comment: Pre-mortems are a thing. Should they be used more often?
  • “We might postulate a notion of Organizational Cognitive Complexity,” he writes, “i.e., that the sheer number of nodes and states to be monitored exceeds the organizational capacity to anticipate or comprehend. Complexity Indices for major refinery unit processes varied from 156,000,000 for alkylation/polymerization down to 90 for crude refining. The very use of a logarithmic scale to classify complexity suggests the cognitive challenges facing an organization.” A single component failure thus produces other failures in a (temporarily) incomprehensible manner if the complexity of the system is cognitively challenging. Berniker adds another significant point in his letter: the complexity of the system may not be necessary from a technical point of view, but is mostly the result of poorly planned accretions. (Page 363)
  • More valuable still is Sagan’s development of two organizational concepts that were important for my book: the notion of bounded rationality (the basis of “Garbage Can Theory”), and the notion of organizations as tools to be used for the interests of their masters (the core of a “power theory”). Both garbage can and power theory are noted in passing in my book, but Sagan’s work makes both explicit tools of analysis. Risky systems, I should have stressed more than I did, are likely to have high degrees of uncertainty associated with them and thus one can expect “garbage can” processes, that is, unstable and unclear goals, misunderstanding and mis-learning, happenstance, and confusion as to means. (For the basic work on garbage can theory, see Cohen, March, and Olsen 1988; March and Olsen 1979. For a brief summary, see Perrow 1986, 131–154.) A garbage can approach is appropriate where there is high uncertainty about means and goals and an unstable environment. It invites a pessimistic view of such things as efficiency or commitments to safety. (Page 369)
  • few managers are punished for not putting safety first even after an accident, but will quickly be punished for not putting profits, market share, or agency prestige first. (Page 370)
  • Second, why doesn’t more learning take place? One answer is provided by the title of an interesting piece: “Learning from mistakes is easier said than done …” (Edmonson 1996) in which Edmonson discusses the ample group process barriers to admitting and learning from mistakes. James Reason offers another colorful and more elaborate discussion (1990). Another is that elites do learn, but the wrong things. They learn that disasters are rare and they are not likely to be vulnerable so, in view of the attractions of creating and running risky systems the benefits truly do outweigh the risks for individual calculators. This applies to lowly operators, too; most of the time cutting corners works; only rarely does it “bite back,” in the wonderful terminology that Tenner uses in his delightful book of accident illustrations, though sometimes it will (Tenner 1996). (Page 371)
  • There are important system characteristics that determine the number of inevitable (though often small) failures which might interact unexpectedly, and the degree of coupling which determines the spread and management of the failures. Some of these are: (page 371)
    • Experience with operating scale—did it grow slowly, accumulating experience (fossil fuel utility plants) or rapidly with no experience of the new configurations or volumes (nuclear power plants)?
    • Experience with the critical phases—if starting and stopping are the risky phases, does this happen frequently (take offs and landings) or infrequently (nuclear plant outages)?
    • Information on errors—is this shared within and between the organizations? (Yes, in the Japanese nuclear power industry; barely in the U.S.) Can it be obtained, as in air transport, or does it go to the bottom with the ship?
    • Close proximity of elites to the operating system—they fly on airplanes but don’t ship on rusty bottoms or live next to chemical plants.
    • Organizational control over members—solid with the naval carriers, partial with nuclear defense, mixed with marine transport (new crews each time), absent with chemical and nuclear plants (and we are hopefully reluctant to militarize all risky systems).
    • Organizational density of the system’s environment (vendors, subcontractors, owner’s/operator’s trade associations and unions, regulators, environmental groups, and competitive systems)—if rich, there will be persistent investigations and less likelihood of blaming God or the operators, and more attempts to increase safety features.
  • When the Exxon Valdez accident occurred Clarke took these insights with him to Alaska, where he came upon a truly mammoth social construction, the several contingency plans put out by the oil industry for cleaning up 80 to 90 percent of any spill, plans that necessitated getting permission to operate in the Sound. The plans were so fanciful that it was hard to take them seriously. There was no case on record of a substantial cleanup of a large spill, yet industry and government regulators promised just that. There was such a disjuncture between what they said they could do and what they actually could do that Clarke saw the plans as symbols rather than blueprints, as “fantasy documents.” (Page 374)
    • Comment: This is an attempt to represent a social reality as a physical one
  • Snook makes another contribution. This is, he says, a normal accident without any failures. How can that be? Well, there were minor deviations in procedures that saved time and effort or corrected some problems without disclosing potential ones. Organizations and large systems probably could not function without the lubricant of minor deviations to handle situations no designer could anticipate. Individually they were inconsequential deviations, and the system had over three years of daily safe operations. But they resulted in the slow, steady uncoupling of local practice—the jets, the AWACS, the army helicopters—from the written procedures, a practice he labels “practical drift.” Local adaptation, he says, can lead to large system disaster. The local adaptations are like a fleet of boats sailing to a destination; there is continuous, intended adjustment to the wind and the movement of nearby boats. The disaster comes when, only from high above the fleet, or from hindsight, we see that the adaptive behavior imperceptibly and unexpectedly result in the boats “drifting” into gridlock, collision, or grounding. (Page 378)
  • Instead, she (Diane Vaughn) builds what would be called a “social construction of reality” case that allowed the banality of bureaucracy to create a habit of normalizing deviations from safe procedures. They all did what they were supposed to do, and were used to doing, and the deviations finally caught up with them. (Page 380)
  • this was not the normalization of deviance or the banality of bureaucratic procedures and hierarchy or the product of an engineering “culture”; it was the exercise of organizational power. We miss a great deal when we substitute culture for power. (Page 380)
    • Comment: One thing that I haven’t really thought about is the role of power in forcing a stampede. There may be herding aspects of a “rush to judgement”
  • New financial instruments such as derivatives and hedge funds and new techniques such as programmed trading further increase the complexity of interactions. Breaking up a loan on a home into tiny packages and selling them on a world-wide basis increases interdependency. (Page 385)
    • Comment: This was written in 1999!
  • But another condition of interdependency is not a bunch of pipes with tight fittings and one direction of flow, but a web of connections wherein some pathways are multipurpose, some might not be used, some reversed, some used in unforeseen ways, some spilling excess capacity, and some buffering and isolating disturbances. This is a more “organic” conception of interdependency, consistent with loose coupling. The distinction is an old one in sociology, going back at least to Emile Durkheim (mechanical and organic solidarity) and Ferdinand Toennies (society versus community) and finding echos in most major sociological thinkers. Predictions follow freely. External threats are supposed to heighten organic solidarity, internal economic downturns, mechanical solidarity. Modernization and bureaucracy find their rationale and techniques in mechanical coupling; communitarian, religious, and cult outbreaks find their rationale in organic coupling and solidarity. (Page 392)
  • This points to the basic difference between 1) software problems—billions of lines of code in forgotten programs in computers that have not been manufactured for ten years—which is bad enough, and 2) the embedded chip problem—billions of chips that are in controllers, rheostats, pumps, valves, regulators, safety systems and so on, which are hard-wired and must be taken out and replaced. (Page 401)

 

Characterizing Online Public Discussions through Patterns of Participant Interactions.

Characterizing Online Public Discussions through Patterns of Participant Interactions

Authors

Overview

An important paper that lays out mechanisms for relating conversations into navigable spaces. To me, this seems like a first step in being able to map human interaction along the dimensions the humans emphasize. In this case, the dimensions have to do with relatively coarse behavior trajectories: Will a participant block another? Will this be a long threaded discussion among a few people or a set of short links all referring to an initial post?

Rooted in the design affordances of facebook, the data that are readily available influence the overall design of the methods used. For example, a significant amount of the work is focussed on temporal network analytics. I think that these methods are quite generalizable to sites like Twitter and Reddit. The fact that the researchers worked at Facebook and had easy access to the data is a critical part of this studies’ success. For me the implications aren’t that surprising (I found myself saying “Yes! Yes!” several times while reading this), but it is wonderful to see then presented in such a clear, defensible way.

My more theoretical thoughts

Though this study is focussed more on building representations of behaviors, I think that the methods used here (particularly as expanded on in the Future Work section) should be extensible to mapping beliefs

The extensive discussion about how the design affordances of Facebook create the form of the discussion is also quite validating. Although they don’t mention it, Moscovici lays this concept out in Conflict and Consensus, where he describes how even items such as table shape can change a conversation so that the probability of compromise over consensus is increased.

Lastly, I’m really looking forward to checking out the Cornell Conversational Analysis Toolkit, developed for(?) this study.

Notes

  • This paper introduces a computational framework to characterize public discussions, relying on a representation that captures a broad set of social patterns which emerge from the interactions between interlocutors, comments and audience reactions. (Page 198:1)
  • we use it to predict the eventual trajectory of individual discussions, anticipating future antisocial actions (such as participants blocking each other) and forecasting a discussion’s growth (Page 198:1)
  • platform maintainers may wish to identify salient properties of a discussion that signal particular outcomes such as sustained participation [9] or future antisocial actions [16], or that reflect particular dynamics such as controversy [24] or deliberation [29]. (Page 198:1)
  • Systems supporting online public discussions have affordances that distinguish them from other forms of online communication. Anybody can start a new discussion in response to a piece of content, or join an existing discussion at any time and at any depth. Beyond textual replies, interactions can also occur via reactions such as likes or votes, engaging a much broader audience beyond the interlocutors actively writing comments. (Page 198:2)
    • This is why JuryRoom would be distinctly different. It’s unique affordances should create unique, hopefully clearer results.
  • This multivalent action space gives rise to salient patterns of interactional structure: they reflect important social attributes of a discussion, and define axes along which discussions vary in interpretable and consequential ways. (Page 198:2)
  • Our approach is to construct a representation of discussion structure that explicitly captures the connections fostered among interlocutors, their comments and their reactions in a public discussion setting. We devise a computational method to extract a diverse range of salient interactional patterns from this representation—including but not limited to the ones explored in previous work—without the need to predefine them. We use this general framework to structure the variation of public discussions, and to address two consequential tasks predicting a discussion’s future trajectory: (a) a new task aiming to determine if a discussion will be followed by antisocial events, such as the participants blocking each other, and (b) an existing task aiming to forecast the growth of a discussion [9]. (Page 198:2)
  • We find that the features our framework derives are more informative in forecasting future events in a discussion than those based on the discussion’s volume, on its reply structure and on the text of its comments (Page 198:2)
  • we find that mainstream print media (e.g., The New York Times, The Guardian, Le Monde, La Repubblica) is separable from cable news channels (e.g., CNN, Fox News) and overtly partisan outlets (e.g., Breitbart, Sean Hannity, Robert Reich)on the sole basis of the structure of the discussions they trigger (Figure 4).(Page 198:2)
  • figure-4
  • These studies collectively suggest that across the broader online landscape, discussions take on multiple types and occupy a space parameterized by a diversity of axes—an intuition reinforced by the wide range of ways in which people engage with social media platforms such as Facebook [25]. With this in mind, our work considers the complementary objective of exploring and understanding the different types of discussions that arise in an online public space, without predefining the axes of variation. (Page 198:3)
  • Many previous studies have sought to predict a discussion’s eventual volume of comments with features derived from their content and structure, as well as exogenous information [893069, inter alia]. (Page 198:3)
  • Many such studies operate on the reply-tree structure induced by how successive comments reply to earlier ones in a discussion rooted in some initial content. Starting from the reply-tree view, these studies seek to identify and analyze salient features that parameterize discussions on platforms like Reddit and Twitter, including comment popularity [72], temporal novelty [39], root-bias [28], reply-depth [41, 50] and reciprocity [6]. Other work has taken a linear view of discussions as chronologically ordered comment sequences, examining properties such as the arrival sequence of successive commenters [9] or the extent to which commenters quote previous contributions [58]. The representation we introduce extends the reply-tree view of comment-to-comment. (Page 198:3)
  • Our present approach focuses on representing a discussion on the basis of its structural rather than linguistic attributes; as such, we offer a coarser view of the actions taken by discussion participants that more broadly captures the nature of their contributions across contexts which potentially exhibit large linguistic variation.(Page 198:4)
  • This representation extends previous computational approaches that model the relationships between individual comments, and more thoroughly accounts for aspects of the interaction that arise from the specific affordances offered in public discussion venues, such as the ability to react to content without commenting. Next, we develop a method to systematically derive features from this representation, hence producing an encoding of the discussion that reflects the interaction patterns encapsulated within the representation, and that can be used in further analyses.(Page 198:4)
  • In this way, discussions are modelled as collections of comments that are connected by the replies occurring amongst them. Interpretable properties of the discussion can then be systematically derived by quantifying structural properties of the underlying graph: for instance, the indegree of a node signifies the propensity of a comment to draw replies. (Page 198:5)
    • Quick responses that reflect a high degree of correlation would be tight. A long-delayed “like” could be slack?
  • For instance, different interlocutors may exhibit varying levels of engagement or reciprocity. Activity could be skewed towards one particularly talkative participant or balanced across several equally-prolific contributors, as can the volume of responses each participant receives across the many comments they may author.(Page 198: 5)
  • We model this actor-focused view of discussions with a graph-based representation that augments the reply-tree model with an additional superstructure. To aid our following explanation, we depict the representation of an example discussion thread in Figure 1 (Page 198: 6)
  • fig1table1
  • Relationships between actors are modeled as the collection of individual responses they exchange. Our representation reflects this by organizing edges into hyperedges: a hyperedge between a hypernode C and a node c ‘ contains all responses an actor directed at a specific comment, while a hyperedge between two hypernodes C and C’ contains the responses that actor C directed at any comment made by C’ over the entire discussion. (Page 198: 6)
    • I think that this  can be represented as a tensor (hyperdimensional or flattened) with each node having a value if there is an intersection. There may be an overall scalar that allows each type of interaction to be adjusted as a whole
  • The mixture of roles within one discussion varies across different discussions in intuitively meaningful ways. For instance, some discussions are skewed by one particularly active participant, while others may be balanced between two similarly-active participants who are perhaps equally invested in the discussion. We quantify these dynamics by taking several summary statistics of each in/outdegree distribution in the hypergraph representation, such as their maximum, mean and entropy, producing aggregate characterizations of these properties over an entire discussion. We list all statistics computed in the appendices (Table 4). (Page 198: 6, 7)
  • table4
  • To interpret the structure our model offers and address potentially correlated or spurious features, we can perform dimensionality reduction on the feature set our framework yields. In particular, let X be a N×k matrix whose N rows each correspond to a thread represented by k features.We perform a singular value decomposition on X to obtain a d-dimensional representation X ˜ Xˆ = USVT where rows of U are embeddings of threads in the induced latent space and rows of V represent the hypergraph-derived features. (Page 198: 9)
    • This lets us find the hyperplane of the map we want to build
  • Community-level embeddings. We can naturally extend our method to characterize online discussion communities—interchangeably, discussion venues—such as Facebook Pages. To this end, we aggregate representations of the collection of discussions taking place in a community, hence providing a representation of communities in terms of the discussions they foster. This higher level of aggregation lends further interpretability to the hypergraph features we derive. In particular, we define the embedding U¯C of a community C containing threads {t1, t2, . . . tn } as the average of the corresponding thread embeddings Ut1 ,Ut2 , . . .Utn , scaled to unit l2 norm. Two communities C1 and C2 that foster structurally similar discussions then have embeddings U¯C1 and U¯C2 that are close in the latent space.(Page 198: 9)
    • And this may let us place small maps in a larger map. Not sure if the dimensions will line up though
  • The set of threads to a post may be algorithmically re-ordered based on factors like quality [13]. However, subsequent replies within a thread are always listed chronologically.We address elements of such algorithmic ranking effects in our prediction tasks (§5). (Page 198: 10)
  • Taken together, these filtering criteria yield a dataset of 929,041 discussion threads.(Page 198: 10)
  • We now apply our framework to forecast a discussion’s trajectory—can interactional patterns signal future thread growth or predict future antisocial actions? We address this question by using the features our method extracts from the 10-comment prefix to predict two sets of outcomes that occur temporally after this prefix. (Pg 198:10)
    • These are behavioral trajectories, though not belief trajectories. Maps of these behaviors could probably be built, too.
  • For instance, news articles on controversial issues may be especially susceptible to contentious discussions, but this should not translate to barring discussions about controversial topics outright. Additionally, in large-scale social media settings such as Facebook, the content spurring discussions can vary substantially across different sub-communities, motivating the need to seek adaptable indicators that do not hinge on content specific to a particular context. (Page 198: 11)
  • Classification protocol. For each task, we train logistic regression classifiers that use our full set of hypergraph-derived features, grid-searching over hyperparameters with 5-fold cross-validation and enforcing that no Page spans multiple folds.13 We evaluate our models on a (completely fresh) heldout set of thread pairs drawn from the subsequent week of data (Nov. 8-14, 2017), addressing a model’s potential dependence on various evolving interface features that may have been deployed by Facebook during the time spanned by the training data. (Page 198: 11)
    • We use logistic regression classifiers from scikit-learn with l2 loss, standardizing features and grid-searching over C = {0.001, 0.01, 1}. In the bag-of-words models, we tf-idf transform features, set a vocabulary size of 5,000 words and additionally grid-search over the maximum document frequency in {0.25, 0.5, 1}. (Page 198: 11, footnote 13)
  • We test a model using the temporal rate of commenting, which was shown to be a much stronger signal of thread growth than the structural properties considered in prior work [9] (Page 198: 12)
  • Table 3 shows Page-macroaveraged heldout accuracies for our prediction tasks. The feature set we extract from our hypergraph significantly outperforms all of the baselines in each task. This shows that interactional patterns occurring within a thread’s early activity can signal later events, and that our framework can extract socially and structurally-meaningful patterns that are informative beyond coarse counts of activity volume, the reply-tree alone and the order in which commenters contribute, along with a shallow representation of the linguistic content discussed. (Page 198: 12)
    • So triangulation from a variety of data sources produces more accurate results in this context, and probably others. Not a surprising finding, but important to show
  • table3
  • We find that in almost all cases, our full model significantly outperforms each subcomponent considered, suggesting that different parts of the hypergraph framework add complementary information across these tasks. (Page 198: 13)
  • Having shown that our approach can extract interaction patterns of practical importance from individual threads, we now apply our framework to explore the space of public discussions occurring on Facebook. In particular, we identify salient axes along which discussions vary by qualitatively examining the latent space induced from the embedding procedure described in §3, with d = 7 dimensions. Using our methodology, we recover intuitive types of discussions, which additionally reflect our priors about the venues which foster them. This analysis provides one possible view of the rich landscape of public discussions and shows that our thread representation can structure this diverse space of discussions in meaningful ways. This procedure could serve as a starting point for developing taxonomies of discussions that address the wealth of structural interaction patterns they contain, and could enrich characterizations of communities to systematically account for the types of discussions they foster. (Page 198: 14) 
    • ^^^Show this to Wayne!^^^
  • The emergence of these groupings is especially striking since our framework considers just discussion structure without explicitly encoding for linguistic, topical or demographic data. In fact, the groupings produced often span multiple languages—the cluster of mainstream news sites at the top includes French (Le Monde), Italian (La Repubblica) and German (SPIEGEL ONLINE) outlets; the “sports” region includes French (L’EQUIPE) as well as English outlets. This suggests that different types of content and different discussion venues exhibit distinctive interactional signatures, beyond lexical traits. Indeed, an interesting avenue of future work could further study the relation between these factors and the structural patterns addressed in our approach, or augment our thread representation with additional contextual information. (Page 198: 15)
  • Taken together, we can use the features, threads and Pages which are relatively salient in a dimension to characterize a type of discussion. (Page 198: 15)
  • To underline this finer granularity, for each examined dimension we refer to example discussion threads drawn from a single Page, The New York Times(https://www.facebook.com/nytimes), which are listed in the footnotes. (Page 198: 15)
    • Common starting point. Do they find consensus, or how the dimensions reduce?
  • Focused threads tend to contain a small number of active participants replying to a large proportion of preceding comments; expansionary threads are characterized by many less-active participants concentrating their responses on a single comment, likely the initial one. We see that (somewhat counterintuitively) meme-sharing discussion venues tend to have relatively focused discussions. (Page 198: 15)
    • These are two sides of the same dimension-reduction coin. A focused thread should be using the dimension-reduction tool of open discussion that requires the participants to agree on what they are discussing. As such it refines ideas and would produce more meme-compatible content. Expansive threads are dimension reducing to the initial post. The subsequent responses go in too many directions to become a discussion.
  • Threads at one end (blue) have highly reciprocal dyadic relationships in which both reactions and replies are exchanged. Since reactions on Facebook are largely positive, this suggests an actively supportive dynamic between actors sharing a viewpoint, and tend to occur in lifestyle-themed content aggregation sub-communities as well as in highly partisan sites which may embody a cohesive ideology. In threads at the other end (red), later commenters tend to receive more reactions than the initiator and also contribute more responses. Inspecting representative threads suggests this bottom-heavy structure may signal a correctional dynamic where late arrivals who refute an unpopular initiator are comparatively well-received. (Page 198: 17)
  • This contrast reflects an intuitive dichotomy of one- versus multi-sided discussions; interestingly, the imbalanced one-sided discussions tend to occur in relatively partisan venues, while multi-sided discussions often occur in sports sites (perhaps reflecting the diversity of teams endorsed in these sub-communities). (Page 198: 17)
    • This means that we can identify one-sided behavior and use that then to look at they underlying information. No need to look in diverse areas, they are taking care of themselves. This is ecosystem management 101, where things like algae blooms and invasive species need to be recognized and then managed
  • We now seek to contrast the relative salience of these factors after controlling for community: given a particular discussion venue, is the content or the commenter more responsible for the nature of the ensuing discussions? (Page 198: 17)
  • This suggests that, perhaps somewhat surprisingly, the commenter is a stronger driver of discussion type. (Page 198: 18)
    • I can see that. The initial commenter is kind of a gate-keeper to the discussion. A low-dimension, incendiary comment that is already aligned with one group (“lock her up”), will create one kind of discussion, while a high-dimensional, nuanced post will create another.
  • We provide a preliminary example of how signals derived from discussion structure could be applied to forecast blocking actions, which are potential symptoms of low-quality interactions (Page 198: 18)
  • The nature of the discussion may also be shaped by the structure of the underlying social network, such that interactions between friends proceed in contrasting ways from interactions between complete strangers.  (Page 198: 19)
    • Yep, design matters. Diversity injection matters.
  • For instance, as with the bulk of other computational studies, our work relies heavily on indicators of interactional dynamics which are easily extracted from the data, such as replies or blocks. Such readily available indicators can at best only approximate the rich space of participant experiences, and serve as very coarse proxies for interactional processes such as breakdown or repair [27, 62]. As such, our implicit preference for computational expedience limits the granularity and nuance of our analyses. (Page 198: 20)
    • Another argument for funding a platform that is designed to provide these nuances
  • One possible means of enriching our model to address this limitation could be to treat nodes as high-dimensional vectors, such that subsequent responses only act on a subset of these dimensions. (Page 198: 21)
    • Agreed. A set of matrices that represent an aspect of each node should have a rich set of capabilities
  • Accounting for linguistic features of the replies within a discussion necessitates vastly enriching the response types presently considered, perhaps through a model that represents the corresponding edges as higher-dimensional vectors rather than as discrete types. Additionally, linguistic features might identify replies that address multiple preceding comments or a small subset of ideas within the target(s) of the reply, offering another route to move beyond the atomicity of comments assumed by our present framework. (Page 198: 21)
    • Exactly right. High dimensional representations that can then be analyzed to uncover the implicit dimensions of interaction is the way to go, I think.
  • Important references

Similar neural responses predict friendship

Similar neural responses predict friendship

Authors and related work

Overview

A detailed, lay overview has been written up in the New York Times: You Share Everything With Your Bestie. Even Brain Waves.

The study took a cohort (N = 279) of graduate students in a graduate program. Students were asked to list who their friends were, from which a social network was constructed. A subset (N = 42) of these students were then asked to watch a series of videos while their brains were being monitored by an fMRI machine. The timings of brain activations across 80 regions of the brain were compared to see if there were similarities that correlated with social distance. Statistically significant similarities exist such that friends could be identified by firing patterns and timing. Particularly, individuals with one degree of separation were strongly resonant(?), while individuals with three or more degrees of separation could not be discriminated by fMRI.

My more theoretical thoughts:

This is more support for the idea that groups of people “flock” in latent belief space. If everyone fired in the same way to the videos, then the environmental influence would have been dominant – a video of a sloth or a volcano is “objectively” interpreted across a population. Instead, the interpretation of the videos is clustered around individuals with high levels of social connection. Humans spontaneously form groups of preferred sizes organized in a geometrical series approximating 3–5, 9–15, 30–45, etc. This is remarkably similar to the numbers found in social organizations such as flocks of starlings (seven). As we’ve seen in multiple studies, a certain amount of social cohesion is beneficial as away of finding resources in a noisy environment (Grunbaum), so this implies that belief space is noisy, but that beneficial beliefs can be found using similar means.  Grunbaum also finds that excessive social cohesion (stampedes) decrease the ability to find resources. Determining the balance of explore/exploit with respect to depending on your neighbors/friends is uncomputable, but exploration is computationally more expensive than exploitation, so the pressure is always towards some level of stampede.

This means that in physical and belief spaces, the density and stiffness of connections controls the behavior of the social network. By adjusting the dial on the similarity aspect (increasing/decreasing stiffness of the links) should result in nomadic, flocking and stampeding behavior in belief space.

Notes

  • Research has borne out this intuition: social ties are forged at a higher-than expected rate between individuals of the same age, gender, ethnicity, and other demographic categories. This assortativity in friendship networks is referred to as homophily and has been demonstrated across diverse contexts and geographic locations, including online social networks [2345(Page 2)
  • When humans do forge ties with individuals who are dissimilar from themselves, these relationships tend to be instrumental, task-oriented (e.g., professional collaborations involving people with complementary skill sets [7]), and short-lived, often dissolving after the individuals involved have achieved their shared goal. Thus, human social networks tend to be overwhelmingly homophilous [8]. (Page 2)
    • This means that groups can be more efficient, but prone to belief stampede
  • Remarkably, social network proximity is as important as genetic relatedness and more important than geographic proximity in predicting the similarity of two individuals’ cooperative behavioral tendencies [4] (Page 2)
  • how individuals interpret and respond to their environment increases the predictability of one another’s thoughts and actions during social interactions [14], since knowledge about oneself is a more valid source of information about similar others than about dissimilar others. (Page 2)
    • There is a second layer on top of this which may be more important. How individuals respond to social cues (which can have significant survival value in a social animal) may be more important than day-to-day reactions to the physical environment.
  • Here we tested the proposition that neural responses to naturalistic audiovisual stimuli are more similar among friends than among individuals who are farther removed from one another in a real-world social network. Measuring neural activity while people view naturalistic stimuli, such as movie clips, offers an unobtrusive window into individuals’ unconstrained thought processes as they unfold [16(page 2)
  • Social network proximity appears to be significantly associated with neural response similarity in brain regions involved in attentional allocation, narrative interpretation, and affective responding (Page 2)
  • We first characterized the social network of an entire cohort of students in a graduate program. All students (N = 279) in the graduate program completed an online survey in which they indicated the individuals in the program with whom they were friends (see Methods for further details). Given that a mutually reported tie is a stronger indicator of the presence of a friendship than an unreciprocated tie, a graph consisting only of reciprocal (i.e., mutually reported) social ties was used to estimate social distances between individuals. (Page 2)
    • I wonder if this changes as people age. Are there gender differences?
  • The videos presented in the fMRI study covered a range of topics and genres (e.g., comedy clips, documentaries, and debates) that were selected so that they would likely be unfamiliar to subjects, effectively constrain subjects’ thoughts and attention to the experiment (to minimize mind wandering), and evoke meaningful variability in responses across subjects (because different subjects attend to different aspects of them, have different emotional reactions to them, or interpret the content differently, for example). (Page 3)
    • I think this might make the influence more environmental than social. It would be interesting to see how a strongly aligned group would deal with a polarizing topic, even something like sports.
  • Mean response time series spanning the course of the entire experiment were extracted from 80 anatomical regions of interest (ROIs) for each of the 42 fMRI study subjects (page 3)
    • 80 possible dimensions. It would be interesting to see this in latent space. That being said, there is no dialog here, so no consensus building, which implies no dimension reduction.
  • To test for a relationship between fMRI response similarity and social distance, a dyad-level regression model was used. Models were specified either as ordered logistic regressions with categorical social distance as the dependent variable or as logistic regression with a binary indicator of reciprocated friendship as the dependent variable. We account for the dependence structure of the dyadic data (i.e., the fact that each fMRI subject is involved in multiple dyads), which would otherwise underestimate the standard errors and increase the risk of type 1 error [20], by clustering simultaneously on both members of each dyad [2122].
  • For the purpose of testing the general hypothesis that social network proximity is associated with more similar neural responses to naturalistic stimuli, our main predictor variable of interest, neural response similarity within each student dyad, was summarized as a single variable. Specifically, for each dyad, a weighted average of normalized neural response similarities was computed, with the contribution of each brain region weighted by its average volume in our sample of fMRI subjects. (Page 3)
  • To account for demographic differences that might impact social network structure, our model also included binary predictor variables indicating whether subjects in each dyad were of the same or different nationalities, ethnicities, and genders, as well as a variable indicating the age difference between members of each dyad. In addition, a binary variable was included indicating whether subjects were the same or different in terms of handedness, given that this may be related to differences in brain functional organization [23]. (page 3)
  • Logistic regressions that combined all non-friends into a single category, regardless of social distance, yielded similar results, such that neural similarity was associated with a dramatically increased likelihood of friendship, even after accounting for similarities in observed demographic variables. More specifically, a one SD increase in overall neural similarity was associated with a 47% increase in the likelihood of friendship(logistic regression: ß = 0.388; SE = 0.109; p = 0.0004; N = 861 dyads)Again, neural similarity improved the model’s predictive power above and beyond observed demographic similarities, χ2(1) = 7.36, p = 0.006. (Page 4)
  • To gain insight into what brain regions may be driving the relationship between social distance and overall neural similarity, we performed ordered logistic regression analyses analogous to those described above independently for each of the 80 ROIs, again using cluster-robust standard errors to account for dyadic dependencies in the data. This approach is analogous to common fMRI analysis approaches in which regressions are carried out independently at each voxel in the brain, followed by correction for multiple comparisons across voxels. We employed false discovery rate (FDR) correction to correct for multiple comparisons across brain regions. This analysis indicated that neural similarity was associated with social network proximity in regions of the ventral and dorsal striatum … Regression coefficients for each ROI are shown in Fig. 6, and further details for ROIs that met the significance threshold of p < 0.05, FDR-corrected (two tailed) are provided in Table 2. (Page 4)
    • So the latent space that matters involves something on the order of 7 – 9 regions? I wonder if the actions across regions are similar enough to reduce further. I need to look up what each region does.
  • Table 2Figure6
  • Results indicated that average overall (weighted average) neural similarities were significantly higher among distance 1 dyads than dyads belonging to other social distance categories … distance 4 dyads were not significantly different in overall neural response similarity from dyads in the other social distance categories. All reported p-values are two-tailed. (Page 4)
  • Within the training data set for each data fold, a grid search procedure [24] was used to select the C parameter of a linear support vector machine (SVM) learning algorithm that would best separate dyads according to social distance. (Page 5)
  • As shown in Fig. 8, the classifier tended to predict the correct social distances for dyads in all distance categories at rates above the accuracy level that would be expected based on chance alone (i.e., 25% correct), with an overall classification accuracy of 41.25%. Classification accuracies for distance 1, 2, 3, and 4 dyads were 48%, 39%, 31%, and 47% correct, respectively. (Page 6)
  • where the classifier assigned the incorrect social distance label to a dyad, it tended to be only one level of social distance away from the correct answer: when friends were misclassified, they were misclassified most often as distance 2 dyads; when distance 2 dyads were misclassified, they were misclassified most often as distance 1 or 3 dyads, and so on. (Page 6)
  • The results reported here are consistent with neural homophily: people tend to be friends with individuals who see the world in a similar way. (Page 7)
  • Brain areas where response similarity was associated with social network proximity included subcortical areas implicated in motivation, learning, affective processing, and integrating information into memory, such as the nucleus accumbens, amygdala, putamen, and caudate nucleus [27, 28, 29]. Social network proximity was also associated with neural response similarity within areas involved in attentional allocation, such as the right superior parietal cortex [30,31], and regions in the inferior parietal lobe, such as the bilateral supramarginal gyri and left inferior parietal cortex (which includes the angular gyrus in the parcellation scheme used [32]), that have been implicated in bottom-up attentional control, discerning others’ mental states, processing language and the narrative content of stories, and sense-making more generally [3334, 35]. (Page 7)
  • However, the current results suggest that social network proximity may be associated with similarities in how individuals attend to, interpret, and emotionally react to the world around them. (Page 7)
    • Both the environmental and social world
  • A second, not mutually exclusive, possibility pertains to the “three degrees of influence rule” that governs the spread of a wide range of phenomena in human social networks [43]. Data from large-scale observational studies as well as lab-based experiments suggest that wide-ranging phenomena (e.g., obesity, cooperation, smoking, and depression) spread only up to three degrees of geodesic distance in social networks, perhaps due to social influence effects decaying with social distance to the extent that the they are undetectable at social distances exceeding three, or to the relative instability of long chains of social ties [43]. Although we make no claims regarding the causal mechanisms behind our findings, our results show a similar pattern. (Page 8)
    • Does this change with the level of similarity in the group?
  • pre-existing similarities in how individuals tend to perceive, interpret, and respond to their environment can enhance social interactions and increase the probability of developing a friendship via positive affective processes and by increasing the ease and clarity of communication [1415]. (Page 8)