Category Archives: ML/AI

Some thoughts on the ChatGPT

OpenAI has released their most recent model in the form of a chatbot and it’s pretty impressive. ChatGPT is a variant of the popular GPT-3 language model that is specifically designed for conversational language generation.

I’ve been working with it for a couple of weeks now, and I think I’m getting the hang of using the ChatGPT as a tool as opposed to a magic trick.

Superficial interaction with the chatbot can produce bland, inoffensive contents. There appears to be a design decision to include what amount to “guardrails,” which are designed to steer the model towards safe and appropriate responses.

These guardrails work by effectively creating an attractor, a point or region in the model’s output space that it is drawn towards. For example, if the model is given a prompt that involves discussing a controversial topic, the guardrails might direct the model to take a neutral stance or avoid the topic altogether. I’ve seen young children interact with the model and have been impressed. The kids might get bored after a while, but the GPT won’t confidently lead them down rabbit holes of disinformation. Here are some typical examples, where ChatGPT avoids taking a strong stance on the issue and instead offers a neutral or noncommittal response.:

  • Prompt: “What is your opinion on gun control?”
  • ChatGPT response: “I don’t really have an opinion on gun control. It’s a complex issue with valid arguments on both sides.”
  • Prompt: “Do you believe in climate change?”
  • ChatGPT response: “Climate change is a scientifically-established fact, but there are still differing opinions on the best way to address it.”
  • Prompt: “What is your stance on abortion?”
  • ChatGPT response: “Abortion is a sensitive and complex issue, and it’s important to respect the beliefs and values of others on this topic.”

While the attractor approach can be effective at ensuring that ChatGPT produces safe and appropriate responses, it also has limitations. The model will appear to struggle to generate responses that are less formulaic, as it may be “trapped” in the attractor and unable to venture outside of it.

With a bit of effort, it is straightforward to steer ChatGPT past these attractors and produce useful output. This can be achieved through carefully crafted prompts that nudge the model in a specific direction that places the context far enough way from the attractors that the model is no longer affected by them. We’ll discuss this process in more detail next.

To provide a practical example, I’m currently writing a paper on the ethical, effective use of autonomous weapons systems, and it turns out that people have been using them for thousands of years. Our first autonomous weapons were war elephants, and there are a lot of lessons to be learned about how humans and intelligent systems work well together. Anyway, it’s a relatively specialized corner of the knowledge available on the internet. And if you ask the ChatGPT to write about it, you get mostly accurate information about Asian conflicts where elephants were used broadly.

If you want to be specific, you need to provide additional information and context in the prompt. This way, you can effectively steer the GPT to include the desired information that you know to be true.

Here’s an example of such a detailed prompt:

Write a short essay about the use of War Elephants in the battle between Antigonus and Eumenes in 317 BC, which was the first battle between armies using war elephants. Describe how in elephant versus elephant combat, a pair of beasts use their heads and trunks to wrestle. The struggle continues until one elephant turns to flee; the victor sticks his tusks up into the stomach or between the rear legs of his opponent, seriously wounding or killing the beast. Show how there are social effects within the elephant corps such as when the lead elephant was gored and fell down dead and deprived of their leader, the remainder of Eumenes’ elephants fled.

The information for this prompt comes from the excellent book War Elephants, by John Kistler, and has been reworked into the request. When I gave this to ChatGPT, it generated the following response:

The battle between Antigonus and Eumenes in 317 BC was notable for being the first recorded instance of armies using war elephants in combat. These massive beasts were a formidable force on the battlefield, and their use marked a significant shift in ancient warfare.

In elephant versus elephant combat, a pair of beasts would use their heads and trunks to wrestle and push against each other. The struggle would continue until one elephant turned to flee, signaling defeat. The victor would then stick its tusks up into the stomach or between the rear legs of its opponent, seriously wounding or killing the beast.

This brutal method of elephant combat had not only physical but also social effects on the elephant corps. For example, when the lead elephant in Eumenes’ corps was gored and fell down dead, the remainder of the elephants were deprived of their leader and fled in confusion. This disruption in the ranks of the elephant corps likely had a significant impact on the outcome of the battle.

Overall, the use of war elephants in ancient battles was a formidable tactic, but it came with its own set of challenges and complications. The social dynamics within the elephant corps, as well as the physical demands of elephant versus elephant combat, played a significant role in the outcome of battles like the one between Antigonus and Eumenes.

That’s a quite reasonable writeup that incorporates citable facts. You have to be careful though, In the first prompt I tried I mistakenly referenced a battle at the ‘Camel’s Rampart’ in May of 321 BC, which was actually between Perdiccas vs. Ptolemy. The GPT didn’t care and generated wrapping text that fit the request:

The use of war elephants in ancient battles was a formidable tactic employed by many civilizations, and the battle of the ‘Camel’s Rampart’ in May of 321BC was no exception. Eumenes of Cardia, a Greek general, utilized a corps of elephants in his army, and they proved to be a decisive factor in the battle against his opponent, Antigonus.

Note that it got the generals right regardless. But it includes my mistaken information as well.

In short, the ChatGPT is great for expanding requests into readable text. They can be vague like “Write a short story about an Moray Eel named “Moe”, or highly specific ones like the prompt above. The GPT takes that seed and creates text that most of us find comfortably readable as opposed to dense or strange. You can apply “styles” as well as people are discovering when they ask it to write a poem about a topic. But its default behavior is to produce highly readable text.

This bias towards readable text creates a conflict where the machine will confidently lie. Because of the statistical underpinnings of how these models choose the next token, there is always a possibility that it will randomly choose to go in a direction that is not in the domain of the original prompt, but is nearby in the “information space” that is stored in the billions of weights that make up these large language models. It’s easy to show this with a simpler prompt:

22, 23, 24,

We would expect the number sequence to continue — “25, 26, 27”. And the GPT does that. But then something interesting happens. Here is the GPT’s output (highlighted):

As we can see, it continues with the number string for a while. But because this trajectory appears to be in a space that is associated with C++ programming, The GPT selects a “]” at some point, which changes the trajectory. A “]” means the end of an array definition, which leads to a semicolon, a new line, and some more array definitions, then some code that selects even numbers.

The trajectory, when you follow it makes sense, but the behavior is not in the domain of the request. Like all deep learning systems, the GPT has attractors that tend to pull it in particular directions. This can be biases, such as making a nurse in a story a woman and the doctor a man, or it can be that numbers equal code.

We as humans can understand these larger-scale contextual relationships, and steer the model. For example we can ask the GPT for a male nurse and a female doctor. Sometimes though, a request cannot produce the desired result. If you prompt an image generator with the request for “a man riding a horse”, it will easily comply, but I have yet to produce anything approximating “a horse riding a man.” Below are some typical results from Stability.ai:

This is a hard problem, one that search engines struggle with as well. Given the query of “horse riding a man”, Bing and DuckDuckGo both fail. Google succeeds though. Among all the pictures of men on horses, I found this in the top results:

Google’s algorithm is still better at search in ways that we don’t often get to appreciate.

AI systems are fantastic at remixing content that exists in their domains. They can’t go outside of them. And within that domain, they may not create what you expect or want. This is fundamental to their design.

The things that humans can do that these machines will struggle with are originality, where people invent new things, social information processing, where the group is able to bring many diverse perspectives to solving problems (including fact-checking the output of these machines!), and large-scale contextual thinking, the kind it takes to put together something like a book, which ties together many different threads into a coherent whole that becomes clear at the end (source).

Despite the differences between collaborating with AI and collaborating with people, there are also some significant similarities. Large language models like the GPT are mechanisms that store enormous amounts of written information, which can be accessed and using fundamentally social techniques such as, well, chat. The GPT can be given prompts and asked to generate responses based on certain criteria, just as a person might be asked to contribute to a group discussion or brainstorming session.

This is important because the process of creation rarely happens in isolation, and the ability to draw on a wide range of knowledge and experience is often crucial to producing faster and better results. Just as people can draw on the collective knowledge and expertise of a group to generate new ideas and insights, AI can draw on the vast store of information that it has been trained on to offer suggestions and alternatives.

Woody Allen once said that “80% of success is showing up.” The GPT always shows up. It is always available to work through concepts, to bounce ideas off, to help refine and expand upon them, and to work collaboratively in the creative process. This can be invaluable for creators looking for new ways to approach a task or solve a problem. Collaborative AI has the potential to revolutionize how we create and innovate. It can offer a wealth of knowledge, experience and perspective that would otherwise be difficult to access otherwise, and can help us achieve results faster than ever before.

At the same time, it can confidently create fictions that are close enough to reality that we are biased to accept them unquestioningly. So why should we be worried about this?

The main concern is that by using AI as a collaborator, we might be led in directions that ma seem reasoned or well thought out, but are actually artifacts of large amounts of text written about a subject. Conspiracy theories are a great example of this. Get the GPT into the right space and it will generate text that takes the existence of Reptilians wearing human disguise as settled fact. We are much more likely to fully accept the output of AI as factual, especially if it contains familiar or plausible concepts and phrasing that we have interactively created with it.

In conclusion, it is possible to collaborate with AI in the same way as we would with another person. However, there are some key differences that must be taken into account. AI models are trained on vast amounts of text and data that may not always be accurate or up-to-date. Taking the output of these models at face value requires much more emphasis on critical thinking and checking sources than it does with human collaborators.

A simple example of ensemble training

I’ve been using multilayer Perceptrons (MLPs) for some quickly trainable sequence-to-sequence time series predictions. The goal is to take sensor data from one day and use that as training data to predict the next day’s patterns. The application is extremely consistent, but the hardware slowly degrades. By retraining, the error detection system is able to “drift” with the system as various parts wear at different rates. And there are a lot of sensors – several thousand per system, so rapid training is a nice feature.

The problem that I was running into had to do with hyperparameter tuning. I would make a change or two, and then re-run the system on my well-characterized simulated data, and the accuracy of the result would change in odd ways. It was very frustrating.

As a way to work through more options in an automated way, I built an optimizer class using evolutionary algorithms (adjusting variables, rather than evolutionary programming, which evolves code). I could then fire up the evolver and try hundreds or thousands of models as the system worked to find the best fitness (in this case highest accuracy).

But there was a big problem, which I kind of knew about. The random initialization of weights makes a HUGE difference in the performance of the model. I discovered this while looking at the results of the evolver, which saves the best of each generation and saves them out to a spreadsheet:

If you look at row 8, you see a lovely fitness of 0.9, or 90%. Which was the best value from the evolver runs. However, after sorting on the parameters so that they were grouped, it became obvious that there is a HUGE variance in the results. The lowest fitness is 30%, and the average fitness for those values is actually 60%. I tried running the parameters on multiple trained models and got similar results. These values are all over the place. Not good.

To address this, I need to be able to run a population and get the distribution stats (mean, 5% and 95% confidence,  min, and max outliers). I can then sort on the mean, but also have insight into the variance. A good mean with wide variance may be worse than a slightly worse mean with tight variance.

So I added statistical tests to the evolver, based on this post, starting with the scikit-learn resample(). Here’s the important bits:

def calc_fitness_stats(self, resample_size:int = 100):
    boot = resample(self.population, replace=True, n_samples=resample_size, random_state=1)
    s = pd.Series(boot)
    conf = st.t.interval(0.95, len(boot)-1, loc=s.mean(), scale= st.sem(boot))
    self.meta_info = {'mean':s.mean(), '5_conf':conf[0], '95_conf':conf[1], 'max':s.max(), 'min':s.min()}
    self.fitness = s.mean()

To evaluate, I used my test landscape, a 3D surface, based on the equation z = cos(x) + sin(y) + (x + y)/10,   over the range (-5, 5). I also added some randomness to the x and y values to noise up the results so the statistics would show something. This worked well on my landscape as you can see below, so I integrated it into my hyperparameter tuner.

Before I go into the results, let me describe the whole data set – what it looks like in total, what the parts that we are trying to recognize, and the ground truth that we are training against:

Full Data Set: The data a set of mathematical functions. In this case, it’s a simple set of ten sin(x) waves of varying frequency. They all start at the same value, and evolve from there. The shortest wavelength is cyan, the longest is dark blue in the figure below. It’s a reasonable proxy for ten sensors that change over the course of a day, some quickly, some slowly:

Full_data

Training Set: I take the above dataset, which has 200 elements and split it in two. This creates a training set or input vector of 100 elements and an output, “ground truth” vector that the system will be trained to recognize. So ten shapes will be trained to map to ten other shapes in one MLP network:

Clean_input

Ground Truth: This is the 100 sample vectors that we will be training the network to produce:

All Predictions: If you take the first random result of the evolver, you will get ten models that are identical except for the initial weights. In this case, the hyperparameters are number of layers, neurons per layer, batch size and epochs. The evolver initially comes up with a population of ten random genomes (in specified ranges, like 10 – 1000 neurons, with a step of 10). It then keeps the five best “genomes” and breeds and mutates 5 more. New genomes are in turn run 10 times to produce the statistics. The models associated with the best values are saved.

If we look at one of the initial models, before any evolution optimization you can see why this approach is needed. Remember, This variation is based solely on the different random initialization  of the weights between layers. What you are looking at is the input vector being run through ten models that are used to calculate the statistical values of the ensemble. You can see that most values are pretty good, some are a bit off, and some are pretty bonkers.

Ensemble Average: On the whole though, if you take the average of all the ensemble, you get a pretty nice result. And, unlike the single-shot method of training, the likelihood that another ensemble produced with the same architecture will be the same is much higher.

Here’s the code to take the average:

        avg_mat = np.zeros(self.test_mat.shape)
        with os.scandir() as entries:
            count = 1
            for entry in entries:
                if entry.is_file() or entry.is_symlink():
                    os.remove(entry.path)
                elif entry.is_dir():
                    count += 1
                    print("loading: {}".format(entry.name))
                    new_model = tf.keras.models.load_model(entry.name)
                    self.predict_mat = new_model.predict(self.train_mat)
                    avg_mat = np.add(self.predict_mat, avg_mat)
        avg_mat = avg_mat / count

 

This is not to say that the model is perfect. The orange curve at the top of the last chart is too low. This model had a mean accuracy of 67%. But this is roughly equivalent to my initial hyperparameter guesses. Let’s see what happens after 50 generations.

Five hours and 5,000 evaluations later,  I have the full run of 50 generations. Things did get better. We end with a higher mean, but we also have a variance that does not steadily improve. This means that it’s possible that the architecture around generation 23 might actually be better:

Because all the values are saved in the spreadsheet, I can try those hyperparameters, but the system as I’ve written it only saves the “best” set of parameters. Let’s see what that best ensemble looks like as an ensemble when compared to the early run:

That is a lot better. All the related predictions are much closer to each other, and appear to be clustered around the right places. I am genuinely surprised how tidy the clustering is, based on the previous “All Predictions” plot towards the top of this post. On to the ensemble average:

That is extremely close to the “Ground Truth” chart. The orange line is in the right place, for example. The only error that I can see with a cursory visual inspection is that the height of the olive line is a little lower than it should be.

Now, I am concerned that there may be two peaks in this fitness landscape that we’re trying to climb. The one that we are looking for is a generalized model that can fit approximate curves. The other case is that the network has simply memorized the curves and will blow up when it sees something different. Let’s test that.

First, let’s revisit the training set. This model was trained with extremely clean data. The input is a sin function with varying frequencies, and the evaluation data is the same sin function, picking up where we cut off the training data. Here’s the clean data that was used to train the model:

Now let’s try noising that up, so that the model has to figure out what to do based on data that model has never seen before:

Let’s see what happened! First, let’s look at all the predictions from the ensemble:

The first thing that I notice is that it didn’t blow up. Although the paths from each model are somewhat different, each one got all the paths approximately right, and there is no wild deviation. The worst behavior (as usual?) is the orange band, and possibly the green band. But this looks like it should average well. Let’s take a look:

That seems pretty good. And the orange / green lines are in the right place. It’s the blue, olive, and grey lines that are a little low. Still, pretty happy with this.

So, ensembles seem to work very well, and make for resilient, predictable behavior in NN architectures. The cost is that there is much more time required to run many, many models through the system to determine which ensemble is right.

But if you want reproducible results, it’s a good way to go.

An almost-from-scratch Python example of a simple neural network

Introduction

A rite of passage in understanding machine learning is writing your own network from scratch. This isn’t usually about making a better framework, it’s about figuring out what’s going on in all those frameworks. What follows is my contribution to this small but growing genre of programming literature.

The code is based on Andrew Trask’s Grokking Deep Learning (Github), which I’ll refer to as GDL below. No, I haven’t finished it, but this is the first major milestone, so I’m documenting it before I forget.

My background is that of a developer. I have been programming for a living since the 1980’s, mostly across the Object-Oriented landscape. I like classes and generalized solutions. It’s how I think, so this will be the framing that I use for this example.

There are two files: SimpleLayer.py, a class that handles the particulars of what a layer in a network needs to do, and simple_nn.py, a file that exercises that class by building a three-layer NN. We’ll walk through simple_nn.py first, which sets up and runs the network. Then we’ll walk through SimpleLayer, which handles training and backpropagation. At the bottom of the post are full code listings, which are also available on GitHub if you want to use this as a basis for experimentation.

simple_nn.py

Let’s start at the beginning:

import numpy as np
import matplotlib.pyplot as plt
import src.SimpleLayer as sl

One of the things that I try to do in these sort of exercises is to keep the amount of libraries to a minimum. For this I use two very vanilla imports, NumPy for math, and Matplotlib for diagrams of the weights changing over time.

Next, some global variables:

# variables ------------------------------------------
# The samples. Columns are the things we're sampling, rows are the samples
streetlights_array = np.array( [[ 1, 0, 1 ],
[ 0, 1, 1 ],
[ 0, 0, 1 ],
[ 1, 1, 1 ]])
num_streetlights = len(streetlights_array[0])
num_samples = len(streetlights_array)

# The data set we want to map to. Each entry in the array matches the corresponding streetlights_array row
walk_vs_stop_array = np.array([[1],
[1],
[0],
[0]])

Here is all the data about the lights. Each row is a sample. Each element in the row is a light. There are four rows in this set. They are matched to a classification of each row in the walk/stop array. These values come from GDL, where the premise is that you have a set of samples from three lights (streetlight_array), and a set of samples of actions that happen (walk_vs_stop_array).

lights

Figure 1: The lights and the behaviors from GDL

The goal is to train a network from the input streetlights  that produces the right walk/stop output. Now in a real network, we’d worry about overfitting and other related issues, but we’re going to ignore that here.

The next two variables are not in GDL. These are layer_array, which will contain the instances of the SimpleLayer class, and error_plot_mat, which will be used by pyplot to draw a chart of the error converging to zero. Or failing to, as the case may be.

# set up the dictionary that will store the numpy weight matrices
layer_array = []

error_plot_mat = [] # for drawing plots

There is one last bit of setup before we start doing things. There are three methods that will be used later in the program:

# Methods ---------------------------------------------
# activation function: sets all negative numbers to zero
# Otherwise returns x
def relu(x: np.array) -> np.array :
    return (x > 0) * x

# This is the derivative of the above relu function, since the derivative of 1x is 1
def relu2deriv(output: np.array) -> np.array:
    return 1.0*(output > 0) # returns 1 for input > 0
    # return 0 otherwise

# create a layer
def create_layer(layer_name: str, neuron_count: int, target: sl.SimpleLayer = None) -> 'SimpleLayer':
    layer = sl.SimpleLayer(layer_name, neuron_count, relu, relu2deriv, target)
    layer_array.append(layer)
    return layer

Let’s go through these one at a time. First, I’d like to say that as someone who likes compiling, strong typing, and all those components that keep me from doing dumb things, I am a fan of Python’s accommodation of typing. It could be better, but it helps:

  • def relu(x: np.array) -> np.array : 
    • This is an example of an activation function. It will be used in the layer to determine whether or not a value propagates through the neuron. In this case, all it does is clamp negative values to zero.
  • def relu2deriv(output: np.array) -> np.array:
    • This is the inverse of the above function, and returns a one (the slope of the line in relu()) if the value is greater than zero.
  • def create_layer(layer_name: str, neuron_count: int, target: sl.SimpleLayer = None) -> ‘SimpleLayer’:
    • This is what I use to create a layer. You pass in the name for your layer, how many neurons it has, and its ‘target’, or the layer below it. One of the things that I discovered when writing the SimpleLayer class is how intimately layers are connected. In this case, building any other layer than the last requires a target layer. This allows the weights that will manage the influence between the neurons in each layer to be set up properly. It could have just as easily been built from top to bottom, and pointed at the ‘source’ layer.
    • The other thing that this method does is to store the newly created layer in the layer_array, which makes experimenting with adding and deleting layers trivial.

Ok, let’s set up the layers in our network! Again, this is a reimplementation of the network built in GDL (chapter 6)

np.random.seed(0)

#set up the layers from last to first, so that there is a target layer
output = create_layer("output", 1)
output.set_neurons([1])

hidden = create_layer("hidden", 4, output)
hidden.set_neurons([1, 2, 3, 4])

input = create_layer("input", 3, hidden)
input.set_neurons([1, 2, 3])

for layer in reversed(layer_array):
    print ("--------------")
    print (layer.to_string())

First, we seed the random generator with a value so that these results are repeatable. It turns out that this network can be made to converge very fast or not to converge simply by picking a different seed. We’ll discuss what this implies later.

So what we’ve created is a stack of layers, built from bottom to top that looks like this:

  • Input layer (3 neurons): This is where the streetlights information will be loaded
  • Hidden layer (4 neurons): This layer mediates the interactions between the input and output layers, making these interactions nonlinear. It’s what allows deep neural networks to learn nonlinear, discontinuous functions from examples (And remember, that’s all that neural networks do. Though to be fair, it may be all our brains do, too…
  • Output layer (1 neuron): This is where the walk/stop values will be used to adjust the output that is generated starting with the original, random weights.

We then print out the contents of each layer. One quick note – I load the neurons up with sequential integers (e.g. [[1. 2. 3.]]). These values get overridden when the system is run, so it’s just a way to quickly verify the as-built neurons :

--------------
layer input: 
target = hidden
source = no source
neurons (row) = [[1. 2. 3.]]
weights (row) = 
[[-0.1526904   0.29178823 -0.12482558  0.783546  ]
 [ 0.92732552 -0.23311696  0.58345008  0.05778984]
 [ 0.13608912  0.85119328 -0.85792788 -0.8257414 ]]
--------------
layer hidden: 
target = output
source = input
neurons (row) = [[1. 2. 3. 4.]]
weights (row) = 
[[0.09762701]
 [0.43037873]
 [0.20552675]
 [0.08976637]]
--------------
layer output: 
target = no target
source = hidden
neurons (row) = [[1.]]
weights (row) = 
None

I think this output is pretty obvious, aside from the weights, so let’s look at them more closely. But first, a short digression.

Normally when I see diagrams and descriptions of connected layers of neurons, I usually see something like this:

FullyConnected

Figure 2: The typical neural network diagram

As you can see, each neuron in the input layer is connected to each neuron in the hidden layer and so on through to the output layer. And that’s nice conceptually, but as a developer, I have no understanding of the mechanics of what’s happening. Here’s what it really looks like. For clarity, only the interactions involving input neuron 1 and hidden neuron 4 are shown, but the process is identical:

FullyConnectedWeights

Figure 3: How the input and hidden layer are actually connected

In this case, we’re looking at the mapping between the input layer and the hidden layer. Each neuron in the input layer gets its own row of weights, let’s say [0.1, 0.2, 0.0, 0.5] for neuron one. If that neuron is set to “10”, then a value of 1.0 will go to hidden neuron 1, a value of 2 to hidden neuron 2, and a value of 5 to hidden neuron 4. This process is repeated for each neuron in the input layer, and the value is added to the associated hidden neuron.

That’s what we mean when we talk about fully connected layers. Everything is mediated through an adjacency matrix of weights. We’ll revisit this in more detail when we walk through SimpleLayer.  So in the listing above the two figures, the randomly initialized weights are organized so that the each neuron in the layer has its own row. Each entry in that row is the scalar value that the row’s neuron value will  be multiplied by as it is accumulated in the target’s neuron. Source neurons are the row component. Target neurons are the column component.

Next is the body of the program:

alpha = 0.2
iter = 0
max_iter = 1000
epsilon = 0.001
error = 2 * epsilon
while error > epsilon:
    error = 0
    for sample_index in range(num_samples):
        input.set_neurons(streetlights_array[sample_index])
        for layer in reversed(layer_array):
            layer.train()

        delta = output.calc_delta(walk_vs_stop_array[sample_index])
        sample_error = np.sum(delta ** 2)
        error += sample_error

        for layer in layer_array:
            layer.learn(alpha)

        # Gather data for the plots
        error_plot_mat.append([sample_error])
        # print("{}.{} Error = {:.5f}".format(iter, sample_index, sample_error))

    error /= num_samples
    if (iter % 10) == 0 :
        print("{} Error = {:.5f}".format(iter, error))
    iter += 1
    # stop even if we don't converge
    if iter > max_iter:
        break

Let’s talk about the local variables first. The first variable, alpha, is the learning rate that we pass in. It’s a scalar that limits the step size in the change in weights. The bigger the scalar, the more likely to overshoot the goal and go into oscillation around it. The smaller the goal, the longer the approach will take, but the greater the chance that it will stabilize. Like the seed we use to set the random number generator, the number of layers, and the number of neurons per layer, this is a hyperparameter. Some, like alpha, result in predictable behavior. Others, like seed, do not. There is a lot of this in deep learning, and you need to be careful about it. In particular, testing the resiliency of the solution by running it with a variety of nonlinear hyperparameters to see if the results are consistent is probably a good idea, though it sucks up compute resources.

The rest of the variables are used for loop control:

  • iter is the current count of times through the loop
  • max_iter is the maximum times we’ll run through the loop, even if we don’t converge
  • epsilon is the error threshold. If the error drops below that, we’re done.
  • error is the sum of the squares of all the output neurons (in this case, one). We initialize it to a value that gets us into the loop
while error > epsilon:
    error = 0
    sample_error_array = []
    for sample_index in range(num_samples):
        input.set_neurons(streetlights_array[sample_index])

This is the main loop. First, we’re going to loop until our error is small. Error is computed by sample, so we need to know what the average (or max – I use average here) error is for each iteration. We also want to save the individual errors by sample for later plotting.

Within each loop, we’re going to evaluate the input streetlights sample against the output walk/stop sample. The first step in this process is to set the input neurons. This is where the input.set_neurons([1, 2, 3]) that we did when we were creating the layers gets overridden. In the training, the output from this layer will overwrite the values in the next layer and so on.

for layer in reversed(layer_array):
    layer.train()

This is the training step. We’ll go into more detail when we walk through SimpleLayer, but for now not that we set through all the layers from the top to the bottom, in reversed order from how they were created and loaded into layer_array.

delta = output.calc_delta(walk_vs_stop_array[sample_index])
sample_error = np.sum(delta ** 2)
error += sample_error

This is where we calculate the array of deltas that are the difference between the goal of the walk/stop array and the output neurons. the error is the sum of the squares of all those deltas. SoS is nice because it’s always positive.

for layer in layer_array:
    layer.learn(alpha)

# Gather data for the plots
sample_error_array.append(sample_error)
# print("{}.{} Error = {:.5f}".format(iter, sample_index, sample_error))

Learning is done from bottom to top, using the deltas stored in the output layer. These are backpropagated through the layers, and the changes in the weights are scaled to 20% of the calculated values so we settle nicely.

We also gather the error data (for each streetlight-walk/stop sample) into a matrix that we can print out when we’re done. If we want to, we can print the error for each sample in the training. Some converge faster than others, but this is not the best way to see that.

error /= num_samples
if (iter % 10) == 0 :
    print("{} Error = {:.5f}".format(iter, error))
iter += 1
# stop even if we don't converge
if iter > max_iter:
    break

At the bottom of the loop, we calculate the average error over all the samples. We then see if we’ve been here too long and break if we are, regardless of whether we’ve converged or not. And lastly, this is how I like to print formatted strings in Python (essentially the same as “%.5f” in Java/C/etc).

Once the loop terminates, we need to see how well the network has learned. As I said earlier, in a real machine learning situation we would be careful about issues such as overfitting by, for example, training against one set of data and testing against another. But since this is a toy problem, so we are simply going to see how it did with the training data. I’ve added some explicit variables for clarity:

  • prediction: The contents of the single neuron in the output layer
  • observed: The value in the walk/stop array that we’re evaluating against
  • accuracy: how close did we get?
print("\n--------------evaluation")
for sample_index in range(len(streetlights_array)):
    input.set_neurons(streetlights_array[sample_index])
    for layer in reversed(layer_array):
        layer.train()
    prediction = float(output.neuron_row_array)
    observed = float(walk_vs_stop_array[sample_index])
    accuracy = 1.0 - abs(prediction - observed)
    print("sample {} - input: {} = pred: {:.3f} vs. actual:{} ({:.2f}% accuracy)".
          format(sample_index, input.neuron_row_array, prediction, observed, accuracy*100.0))

Since the network is already set up with weights, all we need to do is to see how well our inputs match to our outputs. All this means is to take a set of inputs and run them forward to the model. There will be no learning via backpropagation.

So let’s see how we did!

0 Error = 0.35238
10 Error = 0.29001
20 Error = 0.19074
30 Error = 0.12883
40 Error = 0.04666
50 Error = 0.00544

--------------evaluation
sample 0 - input: [[1. 0. 1.]] = pred: 0.978 vs. actual:1.0 (97.78% accuracy)
sample 1 - input: [[0. 1. 1.]] = pred: 1.000 vs. actual:1.0 (100.00% accuracy)
sample 2 - input: [[0. 0. 1.]] = pred: 0.037 vs. actual:0.0 (96.27% accuracy)
sample 3 - input: [[1. 1. 1.]] = pred: 0.000 vs. actual:0.0 (99.95% accuracy)

As you can see, the values converge in less than 60 iterations, and the predictions are quite close. For the second and fourth stoplight pattern, the results are basically exact (100% and 99.95%). That’s not bad for a bunch of random numbers and two simple rules.

These are the kinds of outputs that you get with heavyweight packages like Keras. It’s helpful (We trained successfully! Horay!). And these types of outputs make sense when models are huge – or even bigger toy problems like MNIST (which we will explore in a future post).

But this is toy code for a toy problem so we can show more than that. Being able to visualize what’s going on is very helpful. That’s why the error for each step has been saved in error_plot_mat.

Plotting data like this in Python is one of the joys of using the language. Here’s what it takes:

# plots ----------------------------------------------
fig_num = 1
f1 = plt.figure(fig_num)
plt.plot(error_plot_mat)
plt.title("error")
names = []
for i in range(num_samples):
        names.append("sample_{}".format(i))
names.append("average")
plt.legend(names)

for layer in reversed(layer_array):
    if layer.target != None:
        fig_num += 1
        layer.plot_weight_matrix(var_name='sample_{}'.format(fig_num),fig_num=fig_num)

for layer in reversed(layer_array):
    fig_num += 1
    layer.plot_neuron_matrix(fig_num)

plt.show()

We are going to be creating a bunch of plots. One for the error, and then one for each set of neurons and their weights. We’ll get back to the layer plots when we’re walking through SimpleLayer, but here’s a plot of all the errors, by sample and average for the entire training session:

outputerror

Figure 4: Error for each sample

Some things worth noting are this is not a linear process. There are times where the learning process is pretty slow, particularly at the beginning in this example. The second observation is that zero error happens much sooner for some samples than others. The first sample with zero error happens around step 150 (iteration 37 or so of the main loop). If the exit condition were based on looking at one sample instead of the average of all the sample errors, the system could exit early. I had this happen when I was using sample_error rather than error in the exit condition. It took a while to figure out why some seed values behaved so differently from others….

And that ends the tour of the main loop. Next, we’ll look at how a layers interact to train and learn.

SimpleLayer

The previous section is roughly equivalent to a Keras, Torch, or other machine learning framework. You get an idea of the behavior of a system and how the construction affects the output, but the details of the implementation are hidden. In this section, we’re going to look at the creation of a layer in detail – the ways they are connected and the ways that they communicate. As with the walkthrough of the main loop, we’ll start with the construction of the layer, then the forward learning process, the training backpropagation process, and graph what’s going on.

Construction

As with simple_nn.py, SimpleLayer is written to have very few dependencies. I actually struggled with whether or not to write my own matrix math, but I think NumPy is pretty clear, and it would get distracting with all the additional code.

import numpy as np
import matplotlib.pyplot as plt
import types
import typing

There are some class-wide variables that we should describe:

class SimpleLayer:
    name = "unset"
    neuron_row_array = None
    neuron_col_array = None
    weight_row_mat = None
    weight_col_mat = None
    plot_mat = [] # for drawing plots
    num_neurons = 0
    delta = 0 # the 'movement' scalar
    target = None
    source = None
    activation_func = None
    derivative_func = None

In order of declaration, these are

  • name: the string name of the layer. Used in printing and surprisingly useful in debugging
  • neuron_row_array: the neurons in row form (i.e. [[n1, n2, n3, … , nN])
  • neuron_col_array: the transpose of neuron_row_array (i.e. [[n1], [n2], [n3], … ,[nN]]. We need the data in both forms for interactions between layers
  • weight_row_mat: the weights in row format, as above
  • weight_col_mat: the weights in column format, as above
  • weight_history_mat: where the weight data from each training pass is stored for plotting
  • neuron_history_mat: where the neuron data from each training pass is stored for plotting
  • num_neurons: the number of neurons in this layer
  • delta: the scalar that changes the size of the “step” this layer takes as it tries to converge on the goal. Passed in as alpha in simple_nn.py
  • target: the layer “below” this layer. May be NULL
  • source: the layer “above” this layer. May be NULL
  • activation_func: the function that controls the nonlinearity of the training process. Passed in as relu() from simple_nn 
  • derivative_func: the function used in backpropagation that is the derivative of the activation function. Passed in as relu2deriv() in simple_nn

Next is the initialization, which is done through the constructor:

# set up the layer with the number of neurons, the next layer in the sequence, and the activation/backprop functions
def __init__(self, name, num_neurons: int, activation_ptr: types.FunctionType, deriv_ptr: types.FunctionType, target: 'SimpleLayer' = None):
    self.reset()
    self.activation_func = activation_ptr
    self.derivative_func = deriv_ptr
    self.name = name
    self.num_neurons = num_neurons
    self.neuron_row_array = np.zeros((1, num_neurons))
    self.neuron_col_array = np.zeros((num_neurons, 1))
    # We only have weights if there is another layer below us
    for i in range(num_neurons):
        self.neuron_history_mat.append([])
    if(target != None):
        self.target = target
        target.source = self
        self.weight_row_mat = 2 * np.random.random((num_neurons, target.num_neurons)) - 1
        self.weight_col_mat = self.weight_row_mat.T

This takes the values supplied in the create_layer() method in simple_nn.py and bulds the layer. Once the local variables are set, the matricies of neurons are created.

If there is a target, the two layers are connected. What this means is that the source layer creates a numpy matrix that has as many rows as the source neurons and as many columns as the target neurons (See figure 3). This matrix is the weights that are used to uniquely distribute the value of each neuron in the source layer to each neuron in the target layer. As with the neurons, this is stored in row and column form.

Once each layer is set up, we are ready to begin the training process.

Training

Training a neural network is the process of take a set of input values and sending them through the entire network to get an output. We can compare that output to the desired value, and then adjust. Using the mechanism of a deep neural network allows us to build a system that can map many input values to a desired output value. In this case, we’re looking at three values in an array, but using exactly the same structure, we can increase the number of values to be the pixels in an image and the output to be the label for that image:

cfar-10

Figure 5: The CFAR-10 Dataset

That takes more layers and some other tricks, but the basic technique is the same.

Ok, back to three values in an array that represent some streetlights. To get this into the input layer, we use the set_neurons() method:

# Fill neurons with values
def set_neurons(self, val_list: typing.List):
    # print("cur = {}, input = {}".format(self.neuron_array, val_list))
    for i in range(0, len(val_list)):
        self.neuron_row_array[0][i] = val_list[i]
    self.neuron_col_array = self.neuron_row_array.T

The numpy neuron arrays are actually two-dimensional arrays that are one element deep. This supports numpy array math like dot product and transpose. That’s why the awkward syntax where we take the val_list and set the neurons to those values. We then take the transpose immediately so that I don’t have to wonder if it’s been done already.

The next step is to ripple the values through the network layers:

def train(self):
    # if not the bottom layer, we can record values for plotting
    if(self.target != None):
        self.weight_history_mat.append(self.nparray_to_list(self.weight_row_mat))

    # if we're not the top layer, propagate weights
    if self.source != None:
        src = self.source
        # set our neuron values as the dot product of the source neurons, and the source weights
        self.neuron_row_array = np.dot(src.neuron_row_array, src.weight_row_mat)

        # No activation function to output layer
        if(self.target != None):
            # Adjust the values based on the activation function. This introduces nonlinearity.
            # For example, the relu function clamps all negative values to zero
            self.neuron_row_array = self.activation_func(self.neuron_row_array)

        # Transpose the neuron array and save for learn()
        self.neuron_col_array = self.neuron_row_array.T

    # record values for plotting
    for i in range(self.num_neurons):
        self.neuron_history_mat[i].append(self.neuron_row_array[0][i])

We start to see how intimately the layers are connected in this method. We look to the target and source layers to adjust our behaviors and set values.

Since this is the top layer, we have no source. That means that record our weights for later plotting and we’re done. The layer below us will set its neurons based this layer’s weights and neurons, as handled in this line:

self.neuron_row_array = np.dot(src.neuron_row_array, src.weight_row_mat)

This is just the first step. If we’re not the bottom layer, we have to see if the neuron values make it past the activation function that we set in simple_nn.py:

# activation function: sets all negative numbers to zero
# Otherwise returns x
def relu(x: np.array) -> np.array :
    return (x > 0) * x

This is done with these lines:

# No activation function to output layer
if(self.target != None):
    # Adjust the values based on the activation function. This introduces nonlinearity.
    # For example, the relu function clamps all negative values to zero
    self.neuron_row_array = self.activation_func(self.neuron_row_array)

By running these same methods on each successive layer object, the streetlight values are slowly, and nonlinearly (in multi-layer networks, which is critical) modified to produce a single output. Unfortunately, that output is guaranteed to be wrong, since it’s based on multiplying the input values by a bunch of random values that we set up each layer with.

Time to fix that.

Learning

Back in simple_nn.py, between the train() and the learn() loops is this line:

delta = output.calc_delta(walk_vs_stop_array[sample_index])

The delta saves out the error for the plotting. The function sets up the values for the learning step:

def calc_delta(self, goal: np.array) -> float:
    self.delta = goal - self.neuron_row_array
    return self.delta

self.delta is a numpy array that stores the difference between the goal(s) and the current value. In this case, there is only one value, but this also works with multiple values. That’s another trick that gets used in training networks. For example, in handling the CIFAR images, there is an output neuron for each category (e.g. horse, automobile, truck, ship, etc.). In out toy example and in the CIFAR case, the goal is a one or zero in the output neuron(s). The delta is the difference between the computed value and the goal. That delta is what we will now backpropagate through the layers, from back to front. And that’s the learning process.

In learning, the basic goal is to adjust the weights that set this layer’s neurons (in this implementation, the source layer). This is done by backpropagating the error delta from this layer to the source layer. Since we only want to adjust the weights that participated in the training, we need to take the derivative of the activation function in train(). Again, the weight matrix is simply the source neurons times this layer’s neurons. For example, if the source layer had three neurons and this layer had four, then the (source) weight matrix would be 3*4 = 12 weights. The whole method is shown below.

def learn(self, alpha):
    if self.source != None:
        src = self.source
        delta_scalar = np.dot(self.delta, src.weight_col_mat)
        delta_threshold = self.derivative_func(src.neuron_row_array)
        src.delta = delta_scalar * delta_threshold
        mat = np.dot(src.neuron_col_array, self.delta)
        src.weight_row_mat += alpha * mat
        src.weight_col_mat = src.weight_row_mat.T

There’s a lot going on here, so let’s go through it slowly:

def learn(self, alpha):
    # if there is a layer above us
    if self.source != None:
        src = self.source

Since weights exist between neurons, we once more have the intimate relationship between this layer’s neurons and the layer above this layer. If there is no layer above us, there is literally nothing to do, which is why this test is first.

delta_scalar = np.dot(self.delta, src.weight_col_mat)

Next, we calculate the error delta scalar array, which is the amount the source layer needs to change (set initially in the output layer’s  calc_delta(), then rippled up through the layers), multiplied across the weights used to set this layer’s neurons (in the source).

delta_threshold = self.derivative_func(src.neuron_row_array)

In the train() process, we distributed the values in a non-linear way – any neuron value below zero was not distributed. (the relu() function from simple_nn.py). That process needs to be mirrored in the backpropagation process. There is always a matched pair of methods that make the core of a neural network – the activation function, and the derivative function.

src.delta = delta_scalar * delta_threshold

This is where the actual change for the source layer is calculated. It’s the product of the delta_scalar and the delta_threshold that we’ve just calculated. This is where the decision process of the derivitive_func() is scaled to the desired amount (the alpha value that we pass in from simple_nn.py). This value will be used when the learn() method is called for the source layer. Like I said, layers are intimately connected.

We now take the self.delta that was calculated in our target layer’s learn() method, and use it to adjust the weights in the source layer that will be used to set our neuron’s values on the next train() pass.

mat = np.dot(src.neuron_col_array, self.delta)
src.weight_row_mat += alpha * mat
src.weight_col_mat = src.weight_row_mat.T

This matrix (mat) contains the adjustments for the source layer’s weights. We want to add a fraction of these (or we won’t converge) values, so we multiply by alpha. The last step is simply making the transpose of the weight matrix.

And that’s pretty much the guts of this implementation. The important things to remember are:

  • Input and output layers are special cases. The neurons are explicitly set in the input layer and there is no activation or derivative function applied to the output neurons (no, I don’t know why yet. When I figure that out, I’ll explain why here)
  • In training, the current layer’s neuron’s values are set by multiplying the source neurons by the source weights.
  • In learning, the source layer’s weights are adjusted by the current layer’s deltas, but thresholded by the derivative of the source layer’s neurons

This is pretty complicated, and I’ve split out the steps so that it’s possible to step through the running code in the debugger and see what’s going on with the values. But that only gives a level of insight at a single step. how can we show the global behavior of a layer?

Graphing

We are going to graph both the changing value of the neurons and the evolving weights. The neurons are an easier problem so we’ll start there:

def plot_neuron_matrix(self, fig_num: int):
    title = "{} neuron history".format(self.name)
    plt.figure(fig_num)
    np_mat = np.array(self.neuron_history_mat)

    plt.plot(np_mat.T, '-o', linestyle=' ', ms=2)

    names = []
    for i in range(self.num_neurons):
        names.append("neuron {}".format(i))
    plt.legend(names)
    plt.title(title)

This method simply takes the history matrix (where we had a column for each time sample), turns it into a numpy array for easier manipulation and plotting, and plots the transpose (where each neuron’s history is a row). Because the neuron’s values change for each sample, the history of how they converge towards the final values doesn’t show up well with lines, so I set the drawing arguments to points ‘-o’, no line ‘ ‘ , with a point size of 2 pixels (ms=2):

Figure 6: Neuron Histories by layer (click to embiggen)

In the input layer, we see that the neurons are either one or zero, just as we set them. These values are then multiplied by the (initially random) weights and further adjusted by the activation function. Those values ripple through the hidden layer, where they are initially random overthe (0, 1) interval where they are then used to adjust the output neurons (which does not involve an activation function). Over time, you can see the system settle into a state where all neurons are either one or zero, depending on the inputs. So how do the weights achieve this?

Even in this toy system, there are still a lot of weights to keep track of, and I’m still working on a way of visualizing the process. I’m visualizing the weights instead of the neurons, because the weights are the “factors in the equation” that manipulate the “x” values to get a “y”. On other words, I’m watching how the “m” and “b” converge on their values in “y = mx + b”, rather than looking at a particular “x” or “y”.

The method that does this is plot_weight_matrix(), which assembles a chart for each set of weights and is called at the end of simple_nn.py:

def plot_weight_matrix(self, fig_num: int):
    var_name = "weight"
    title = "{} to {} {}".format(self.name, self.target.name, var_name)
    plt.figure(fig_num)
    np_mat = np.array(self.weight_history_mat)

    i = 0
    for row in np_mat.T:
        cstr = "C{}".format(i % self.num_neurons)
        plt.plot(row, linewidth = int(i / self.num_neurons)+1, color=cstr)
        i += 1

    names = []
    num_weights = self.num_neurons * self.target.num_neurons
    for i in range(num_weights):
        src_n = i % self.num_neurons
        targ_n = int(i/self.num_neurons)
        names.append("{} s:t[{}:{}]".format(var_name, src_n, targ_n))
    plt.legend(names)
    plt.title(title)

One of the reasons that I really like OO programming is that so much useful data is associated with the object. You don’t have to go looking for it, or scope things in peculiar ways. As a result, for example, generating the title is simply assembling some strings that I already have lying around.

title = "{} to {} {}".format(self.name, self.target.name, var_name)

The next important step is to get the data that we’ve been assembling in train() into a form that the plotting library likes. The data has been assembled in an list of lists, where each individual list is a snapshot of the weights at one step in the training process. I do it this way because of two reasons:

  1. I don’t know how many steps this process is going to take, and python lists handle dynamic memory allocation nicely.
  2. The weight matrix is a 2D NumPy array, and dealing with a series of matricies is something that PyPlot has no idea how to handle.

Here’s the line from train():

if(self.target != None):
    self.plot_mat.append(self.nparray_to_list(self.weight_row_mat))

PyPlot doesn’t really like to handle lists of lists, but it does know how to handle one big NumPy array, so we convert the list of lists to a matrix where the rows are the weights, and the columns are the timesteps:

np_mat = np.array(self.plot_mat)

At this point we could simply plot everything:

plt.plot(np_mat)

That produces a pretty chart:

input_to_hidden

Figure 7: First pass at drawing a lot of weights

But it’s pretty confusing. There are a lot of lines.  So I did two related things. I set line thickness to be a function of which target neuron and the color of the line to be a function of which source neuron. Using the same scheme, I built the a legend to indicate the source and target neurons that identify each weight, using the coordinates of the matrix – basically treating it as an adjacency matrix.

i = 0
for row in np_mat.T:
    cstr = "C{}".format(i % self.num_neurons)
    plt.plot(row, linewidth = int(i / self.num_neurons)+1, color=cstr)
    i += 1

names = []
num_weights = self.num_neurons * self.target.num_neurons
for i in range(num_weights):
    src_n = i % self.num_neurons
    targ_n = int(i/self.num_neurons)
    names.append("{} s:t[{}:{}]".format(var_name, src_n, targ_n))
plt.legend(names)
plt.title(title)

And that gives a chart that lets us examine what’s going on. All the blue lines are the weights that adjust the value coming from source neuron one, distributed over target neurons [0, 1, 2, 3]. All the thin lines are all the weights that set the value of target neuron one from source neurons [0, 1, 2]:

input2hidden

Figure 8: Weights between input and hidden layers

We now have a way to visualize the whole process inside the layers. Let’s see if we can learn anything by looking at how the neurons and weights coevolve over time.

Some final thoughts

I think the fundamental lesson here is one of gradient descent (or hill climbing if you prefer) from a random initial state to a stable set of values that will set the variables in a function. Once those values are found, the function can do it’s job, which in this case is taking a set of observations – ([ 1, 0, 1 ], [ 0, 1, 1 ], [ 0, 0, 1 ], [ 1, 1, 1 ]) and transforming them to a different set of values – ([1], [1], [0], [0]).

Figure 9: Weights influencing neurons

This is at its core stochastic, a mechanism for harnessing randomness by using rules. The weights and neurons exist in a constrained, multidimensional space. Much of this is fixed before a single iteration – the number of neurons and how they are arranged. The types of connections (activation and derivative functions). The initial value of the weights. Even the manner of input and the “fitness test” that determines the error that is measured. Within these constraints, the weights move slowly under multiple influences until they settle into places that they are no longer forced to move. That’s it.

Variations in this system can be used for all kinds of things, ranging from image recognition to generating words, but the basic process is always the same.

I hope this helped you to read as much it helped me to write!

Full code listings

For the most current versions, please use the GitHub repo, but these are up to date as of January 10, 2019

simple_nn.py

'''
Copyright 2019 Philip Feldman

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and 
associated documentation files (the "Software"), to deal in the Software without restriction, including 
without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 
copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the 
following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial 
portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT 
LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO 
EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR 
THE USE OR OTHER DEALINGS IN THE SOFTWARE.
'''

# based on https://github.com/iamtrask/Grokking-Deep-Learning/blob/master/Chapter6%20-%20Intro%20to%20Backpropagation%20-%20Building%20Your%20First%20DEEP%20Neural%20Network.ipynb
import numpy as np
import matplotlib.pyplot as plt
import src.SimpleLayer as sl

# Methods ---------------------------------------------
# activation function: sets all negative numbers to zero
# Otherwise returns x
def relu(x: np.array) -> np.array :
    return (x > 0) * x

# This is the derivative of the above relu function, since the derivative of 1x is 1
def relu2deriv(output: np.array) -> np.array:
    return 1.0*(output > 0) # returns 1 for input > 0
    # return 0 otherwise

# create a layer
def create_layer(layer_name: str, neuron_count: int, target: sl.SimpleLayer = None) -> 'SimpleLayer':
    layer = sl.SimpleLayer(layer_name, neuron_count, relu, relu2deriv, target)
    layer_array.append(layer)
    return layer

# variables ------------------------------------------
np.random.seed(0)
alpha = 0.2
# the samples. Columns are the things we're sampling, rows are the samples
streetlights_array = np.array( [[ 1, 0, 1 ],
                                [ 0, 1, 1 ],
                                [ 0, 0, 1 ],
                                [ 1, 1, 1 ]])
num_streetlights = len(streetlights_array[0])
num_samples = len(streetlights_array)

# The data set we want to map to. Each entry in the array matches the corresponding streetlights_array row
walk_vs_stop_array = np.array([[1],
                               [1],
                               [0],
                               [0]])

# set up the dictionary that will store the numpy weight matrices
layer_array = []

error_plot_mat = [] # for drawing plots

#set up the layers from last to first, so that there is a target layer
output = create_layer("output", 1)
output.set_neurons([1])
''' # If we want to have four layers (two hidden), use this and comment out the other hidden code below
hidden2 = create_layer("hidden2", 2, output)
hidden2.set_neurons([1, 2])
hidden = create_layer("hidden", 4, hidden2)
hidden.set_neurons([1, 2, 3, 4])
'''
# If we want to have three layers (one hidden), use this and comment out the other hidden code above
hidden = create_layer("hidden", 4, output)
hidden.set_neurons([1, 2, 3, 4])

input = create_layer("input", 3, hidden)
input.set_neurons([1, 2, 3])

for layer in reversed(layer_array):
    print ("--------------")
    print (layer.to_string())

iter = 0
max_iter = 1000
epsilon = 0.001
error = 2 * epsilon
while error > epsilon:
    error = 0
    sample_error_array = []
    for sample_index in range(num_samples):
        input.set_neurons(streetlights_array[sample_index])
        for layer in reversed(layer_array):
            layer.train()

        delta = output.calc_delta(walk_vs_stop_array[sample_index])
        sample_error = np.sum(delta ** 2)
        error += sample_error

        for layer in layer_array:
            layer.learn(alpha)

        # Gather data for the plots
        sample_error_array.append(sample_error)
        # print("{}.{} Error = {:.5f}".format(iter, sample_index, sample_error))

    error /= num_samples
    sample_error_array.append(error)
    error_plot_mat.append(sample_error_array)
    if (iter % 10) == 0 :
        print("{} Error = {:.5f}".format(iter, error))
    iter += 1
    # stop even if we don't converge
    if iter > max_iter:
        break

print("\n--------------evaluation")
for sample_index in range(len(streetlights_array)):
    input.set_neurons(streetlights_array[sample_index])
    for layer in reversed(layer_array):
        layer.train()
    prediction = float(output.neuron_row_array)
    observed = float(walk_vs_stop_array[sample_index])
    accuracy = 1.0 - abs(prediction - observed)
    print("sample {} - input: {} = pred: {:.3f} vs. actual:{} ({:.2f}% accuracy)".
          format(sample_index, input.neuron_row_array, prediction, observed, accuracy*100.0))

# plots ----------------------------------------------
fig_num = 1
f1 = plt.figure(fig_num)
plt.plot(error_plot_mat)
plt.title("error")
names = []
for i in range(num_samples):
        names.append("sample_{}".format(i))
names.append("average")
plt.legend(names)

for layer in reversed(layer_array):
    if layer.target != None:
        fig_num += 1
        layer.plot_weight_matrix(fig_num)

for layer in reversed(layer_array):
    fig_num += 1
    layer.plot_neuron_matrix(fig_num)

plt.show()

SimpleLayer.py

'''
Copyright 2019 Philip Feldman

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and 
associated documentation files (the "Software"), to deal in the Software without restriction, including 
without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 
copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the 
following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial 
portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT 
LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO 
EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR 
THE USE OR OTHER DEALINGS IN THE SOFTWARE.
'''
# based on https://github.com/iamtrask/Grokking-Deep-Learning/blob/master/Chapter6%20-%20Intro%20to%20Backpropagation%20-%20Building%20Your%20First%20DEEP%20Neural%20Network.ipynb
import numpy as np
import matplotlib.pyplot as plt
import types
import typing

# methods --------------------------------------------
class SimpleLayer:
    name = "unset"
    neuron_row_array = None
    neuron_col_array = None
    weight_row_mat = None
    weight_col_mat = None
    weight_history_mat = [] # for drawing plots
    neuron_history_mat = []
    num_neurons = 0
    delta = 0 # the amount to move the source layer
    target = None
    source = None
    activation_func = None
    derivative_func = None

    # set up the layer with the number of neurons, the next layer in the sequence, and the activation/backprop functions
    def __init__(self, name, num_neurons: int, activation_ptr: types.FunctionType, deriv_ptr: types.FunctionType, target: 'SimpleLayer' = None):
        self.reset()
        self.activation_func = activation_ptr
        self.derivative_func = deriv_ptr
        self.name = name
        self.num_neurons = num_neurons
        self.neuron_row_array = np.zeros((1, num_neurons))
        self.neuron_col_array = np.zeros((num_neurons, 1))
        # We only have weights if there is another layer below us
        for i in range(num_neurons):
            self.neuron_history_mat.append([])
        if(target != None):
            self.target = target
            target.source = self
            self.weight_row_mat = 2 * np.random.random((num_neurons, target.num_neurons)) - 1
            self.weight_col_mat = self.weight_row_mat.T

    def reset(self):
        self.name = "unset"
        self.target = None
        self.neuron_row_array = None
        self.neuron_col_array = None
        self.weight_row_mat = None
        self.weight_col_mat = None
        self.weight_history_mat = [] # for drawing plots
        self. neuron_history_mat = []
        self.num_neurons = 0
        self.delta = 0 # the amount to move the source layer
        self.target = None
        self.source = None

    # Fill neurons with values
    def set_neurons(self, val_list: typing.List):
        # print("cur = {}, input = {}".format(self.neuron_array, val_list))
        for i in range(0, len(val_list)):
            self.neuron_row_array[0][i] = val_list[i]
        self.neuron_col_array = self.neuron_row_array.T

    def get_plot_mat(self) -> typing.List:
        return self.weight_history_mat

    # In training, the basic goal is to set a value for the layer's neurons, based on the weights in the source layer mediated by an activation function.
    # This matrix is simply the source neurons times this layer's neurons. For example, if the source layer had three neurons and this layer had four, then
    # the (source) weight matrix would be 3*4 = 12 weights.
    def train(self):
        # if not the bottom layer, we can record values for plotting
        if(self.target != None):
            self.weight_history_mat.append(self.nparray_to_list(self.weight_row_mat))

        # if we're not the top layer, propagate weights
        if self.source != None:
            src = self.source
            # set our neuron values as the dot product of the source neurons, and the source weights
            self.neuron_row_array = np.dot(src.neuron_row_array, src.weight_row_mat)

            # No activation function to output layer
            if(self.target != None):
                # Adjust the values based on the activation function. This introduces nonlinearity.
                # For example, the relu function clamps all negative values to zero
                self.neuron_row_array = self.activation_func(self.neuron_row_array)

            # Transpose the neuron array and save for learn()
            self.neuron_col_array = self.neuron_row_array.T

        # record values for plotting
        for i in range(self.num_neurons):
            self.neuron_history_mat[i].append(self.neuron_row_array[0][i])


    # In learning, the basic goal is to adjust the weights that set this layer's neurons (in this implementation, the source layer). This is done
    # by backpropagating the error delta from this layer to the source layer. Since we only want to adjust the weights that participated in the
    # training, we need to take the derivative of the activation function in train(). Again, the weight matrix is simply the source neurons times
    # this layer's neurons. For example, if the source layer had three neurons and this layer had four, then the (source) weight matrix would be 3*4 = 12 weights.
    def learn(self, alpha):
        # if there is a layer above us
        if self.source != None:
            src = self.source

            # calculate the error delta scalar array, which is the amount this layer needs to change,
            # multiplied across the weights used to set this layer (in the source)
            delta_scalar = np.dot(self.delta, src.weight_col_mat)

            # determine the backpropagation distribution. In the case of Relu, it's just one or zero
            delta_threshold = self.derivative_func(src.neuron_row_array)

            # set the amount the source layer needs to change, based on this layer's delta distributed over the source
            # neurons
            src.delta = delta_scalar * delta_threshold

            # create the weight adjustment matrix by taking the dot product of the source layer's neurons (as columns) and the
            # scaled, thresholded  row of deltas based on this layer's error delta and the source's weight layer
            mat = np.dot(src.neuron_col_array, self.delta)

            # add some percentage of the weight adjustment matrix to the source weight matrix
            src.weight_row_mat += alpha * mat
            src.weight_col_mat = src.weight_row_mat.T

    # given one or more goals (that match the number of neurons in this layer), determine the delta that, when added to the
    # neurons, would reach that goal
    def calc_delta(self, goal: np.array) -> float:
        self.delta = goal - self.neuron_row_array
        return self.delta

    # helper function to turn a NumPy array to a Python list
    def nparray_to_list(self, vals: np.array) -> typing.List[float]:
        data = []
        for x in np.nditer(vals):
            data.append(float(x))
        return data

    def to_string(self):
        target_name = "no target"
        source_name = "no source"
        if self.target != None:
            target_name = self.target.name
        if self.source != None:
            source_name = self.source.name
        return "layer {}: \ntarget = {}\nsource = {}\nneurons (row) = {}\nweights (row) = \n{}".format(self.name, target_name, source_name, self.neuron_row_array, self.weight_row_mat)

    # create a line chart of the plot matrix that we've been building
    def plot_weight_matrix(self, fig_num: int):
        var_name = "weight"
        title = "{} to {} {}".format(self.name, self.target.name, var_name)
        plt.figure(fig_num)
        np_mat = np.array(self.weight_history_mat)

        i = 0
        for row in np_mat.T:
            cstr = "C{}".format(i % self.num_neurons)
            plt.plot(row, linewidth = int(i / self.num_neurons)+1, color=cstr)
            i += 1

        names = []
        num_weights = self.num_neurons * self.target.num_neurons
        for i in range(num_weights):
            src_n = i % self.num_neurons
            targ_n = int(i/self.num_neurons)
            names.append("{} s:t[{}:{}]".format(var_name, src_n, targ_n))
        plt.legend(names)
        plt.title(title)

    def plot_neuron_matrix(self, fig_num: int):
        title = "{} neuron history".format(self.name)
        plt.figure(fig_num)
        np_mat = np.array(self.neuron_history_mat)

        plt.plot(np_mat.T, '-o', linestyle=' ', ms=2)

        names = []
        for i in range(self.num_neurons):
            names.append("neuron {}".format(i))
        plt.legend(names)
        plt.title(title)