
Welcome to the sixth episode of the series. This one took a long time to prepare, as it is going to delve deep into the concept of Backpropagation, maybe the most important algorithm in Machine Learning.
It’s hard to fully understand it, as it requires multiple levels of understanding. It’s not complicated itself, it’s just complex. Indeed, having a decent background in calculus is enough from a knowledge point of view, but you have to think simultaneously at multiple levels to switch from math to code, and this is something that requires several days of reasoning, 3Blue1Brown rewatch and a lot of ELI5 to ChatGPT.
In this post, I will start by explaining the gradient descent in the way I had to study it to fully understand it. You’ll see that it’s strongly based on the 3Blue1Brown video, but I had to approach it slightly differently to really understand what was going on. Then, I will switch to the implementation and lastly test the results. Let’s start!
Backpropagation – Calculus
In machine learning, backpropagation is a gradient estimation method used to train neural network models. The gradient estimate is used by the gradient descent algorithm to compute the network’s parameter updates. It’s a very efficient way to compute exact gradients without the need for finite differences. So why did I implement the finite difference method and didn’t jump straight to backpropagation? Well, as you’ll see, backpropagation is not very intuitive and would have added too much complexity, which would have made me and you lose focus. But now that everything is set up, we can get on with it.
Let’s start simple, and look at the following architecture of a very simple network:

As always, we want to minimize the MSE, from now on called for simplicity Cost Function C. That is something in this form:
C(w_1, b_1, w_2, b_2, w_3, b_3, w_4, b_4)What we want to know is, in simple terms: “How much does a change in any of the parameters affect the cost function?“
This is not something new, it means we want to compute a gradient such that when applied it slowly makes the cost function go toward zero. To do so, instead of computing the derivative using a method like the finite difference (that is, by the way, just an approximation), we try to understand how to fix the error in the prediction, changing the values of the parameters going backwards, from the measurement of the error, until we get to the first layer.
I know this sounds complex, but we will try to understand it step-by-step. Let’s focus, for example, just on the last two neurons, and let’s update the notation: with the superscript, we mean the parameter at a given layer. For example, the weight of the layer L is w^{(L)}.

And let’s now write the cost function for the first input sample. Remember that to compute the MSE, we have to average all the cost functions. But for the moment let’s focus on the first one:
C_0 = (a^{L} - y)^2where a^{(L)} is the output of the activation function a^{(L)} = \sigma(z^{(L)}), where z^{(L)} is:
z^{(L)} = w^{(L)}a^{(L-1)} + b^{(L)}It’s okay to stop for a moment and think about these formulas, there’s nothing new about it, but it’s important to have them very clearly in mind.
Ok, let’s sum up:
\begin{aligned} C_0 &= (a^{(L)} - y)^2 \\ a^{(L)} &= \sigma(z^{(L)}) \\ z^{(L)} &= w^{(L)}a^{(L-1)} + b^{(L)} \end{aligned}Nice, let’s move on. Remember that we want to compute the gradient of the cost function that we can apply to the neural network’s parameters while doing the gradient descent. The gradient is something like this:
\nabla C=\left[\begin{array}{c} \frac{\partial C}{\partial w^{(1)}} \\ \frac{\partial C}{\partial b^{(1)}} \\ \vdots \\ \frac{\partial C}{\partial w^{(L)}} \\ \frac{\partial C}{\partial b^{(L)}} \end{array}\right]So, now, we just have to compute all of the partial derivatives we have inside of the square brackets. Let’s start, for example, from \frac{\partial C}{\partial w^{(L)}}:
\begin{aligned} \frac{\partial C}{\partial w^{(L)}} &= \frac{\partial}{\partial w^{(L)}} \left((a^{(L)} - y)^2\right) \\ &= \frac{\partial}{\partial w^{(L)}} \left((\sigma(z^{(L)}) - y)^2\right) \\ &= \frac{\partial}{\partial w^{(L)}} \left((\sigma(w^{(L)}a^{(L-1)} + b^{(L)}) - y)^2\right) \end{aligned}This is actually pretty hard to compute. But, fortunately, we have a very cool property of derivatives, called the chain-rule. The chain-rule, states that having
y = f(u) \text{ where } u = g(x)the derivative \frac{dy}{dx} can be computed as:
\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}And this applies to any arbitrary long nested function:
y = f(u) \text{ where } u = g(v) \text{ where } v = k(x) \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dv} \cdot \frac{dv}{dk}So, if C_0, written in simple terms, is a nested function like this:
C_0\left(a^{(L)}\left(z^{(L)}\left(w^{(L)}\right)\right)\right)it means we can write its derivative as:
\frac{\partial C_0}{\partial w^{(L)}}= \frac{\partial C_0}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial z^{(L)}} \cdot \frac{\partial z^{(L)}}{\partial w^{(L)}}Now, is quite easy to get the derivatives of each factor:
\begin{alignedat}{2} \frac{\partial C_0}{\partial a^{(L)}} &= \frac{\partial}{\partial a^{(L)}} \left((a^{(L)} - y)^2\right) &&= 2(a^{(L)} - y) \\ \frac{\partial a^{(L)}}{\partial z^{(L)}} &= \frac{\partial}{\partial z^{(L)}} \left(\sigma(z^{(L)})\right) &&= \sigma(z^{(L)}) (1 - \sigma(z^{(L)})) \\ \frac{\partial z^{(L)}}{\partial w^{(L)}} &= \frac{\partial}{\partial w^{(L)}} \left(w^{(L)}a^{(L-1)} + b^{(L)}\right) &&= a^{(L-1)} \end{alignedat}Putting all the pieces together we get:
\frac{\partial C_0}{\partial w^{(L)}} = 2(a^{(L)} - y) \cdot \sigma(z^{(L)}) (1 - \sigma(z^{(L)})) \cdot a^{(L-1)}And that’s it! If we now average over all the samples, we get:
\frac{\partial C}{\partial w^{(L)}} = \frac{1}{n}\sum_{i=0}^{n}\frac{\partial C_i}{\partial w^{(L)}}With this, we computed a single value of all the cost function’s gradient.
Let’s focus on another term, for example on \frac{\partial C_0}{b^{(L)}}. We can apply the chain rule in the same way, it won’t change too much, just the last term:
\frac{\partial C_0}{\partial w^{(L)}}= \frac{\partial C_0}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial z^{(L)}} \cdot \frac{\partial z^{(L)}}{\partial b^{(L)}}The first two terms are the same, the only thing that changes is the last term, that is:
\frac{\partial z^{(L)}}{\partial b^{(L)}} = \frac{\partial}{\partial w^{(L)}} \left(w^{(L)}a^{(L-1)} + b^{(L)}\right) = 1The final form, then, is simply:
\frac{\partial C_0}{\partial b^{(L)}}= 2(a^{(L)} - y) \cdot \sigma(z^{(L)}) (1 - \sigma(z^{(L)}))Ok, nice, so we successfully computed the values for the weight and bias of the last layer, let’s move on with one last example (here you’ll see the strength of the backpropagation in action):
\frac{\partial C_0}{\partial w^{(L-1)}} = \frac{\partial C_0}{\partial a^{(L-1)}} \cdot \frac{\partial a^{(L)}}{\partial z^{(L-1)}} \cdot \frac{\partial z^{(L-1)}}{\partial w^{(L-1)}}Notice how computing the last two elements is trivial, as it was before, but what about the first one? How are we supposed to know its form? Well, this is what backpropagation is all about. After computing the gradients of the weights and biases, we move on by computing the gradient of the activations of the previous layer as well. It is a way of propagating backwards how much we want the previous weights and biases to change the previous layer’s activation. Therefore, before moving to the previous layer, we compute the activation of it as:
\frac{\partial C_0}{\partial a^{(L-1)}} = \frac{\partial C_0}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial z^{(L)}} \cdot \frac{\partial z^{(L)}}{\partial a^{(L-1)}}Where, as with the bias, the only thing that changes is the last term:
\frac{\partial z^{(L)}}{\partial a^{(L-1)}} = \frac{\partial}{\partial w^{(L)}} \left(w^{(L)}a^{(L-1)} + b^{(L)}\right) = w^{(L)}Its final form, then, is:
\frac{\partial C_0}{\partial b^{(L)}}= 2(a^{(L)} - y) \cdot \sigma(z^{(L)}) (1 - \sigma(z^{(L)})) \cdot w^{(L)}From now on we can apply the same reasoning to all layers and have a complete gradient calculated.
It’s ok if you don’t get it at first. Take your time, it can take days to understand. So don’t worry. But when it’s all clear, we can move on to a more complex network, and you’ll see how everything applies in the same way, but with another level of complexity.

We have introduced another term into the notation: the subscript, which refers to a particular neuron in the layer. This value will be j when iterating over neurons in the current layer, and k when iterating over neurons in the previous layer. Also be careful with the indices of the weights, changing them can lead to incorrect behaviour.
Let’s update the formulas we wrote before to accommodate the new model we defined:
\begin{aligned} C_0 &= \sum_{j = 0}^{n_{L}-1} (a_j^{(L)} - y_j)^2 \\ a_j^{(L)} &= \sigma(z_j^{(L)}) \\ z_j^{(L)} &= \sum_{k = 0}^{n_L -1} (w_{jk}^{(L)} \cdot a_k^{(L-1)} + b_j^{L}) \end{aligned}I know it looks complicated. We just added another layer of complexity, but the reasoning is the same as before.
Indeed, just by adding the proper indices, we get the same formula for the partial derivative of the weights:
\begin{aligned} \frac{\partial C_0}{\partial w_{jk}^{(L)}} &= \frac{\partial C_0}{\partial a_j^{(L)}} \cdot \frac{\partial a_j^{(L)}}{\partial z_j^{(L)}} \cdot \frac{\partial z_j^{(L)}}{\partial w_{jk}^{(L)}} \\ &= 2(a_j^{(L)} - y_j) \cdot \sigma(z_j^{(L)}) (1 - \sigma(z_j^{(L)})) \cdot a_j^{(L-1)} \end{aligned}for the biases:
\begin{aligned} \frac{\partial C_0}{\partial b_{j}^{(L)}} &= \frac{\partial C_0}{\partial a_j^{(L)}} \cdot \frac{\partial a_j^{(L)}}{\partial z_j^{(L)}} \cdot \frac{\partial z_j^{(L)}}{\partial b_{j}^{(L)}} \\ &= 2(a_j^{(L)} - y_j) \cdot \sigma(z_j^{(L)}) (1 - \sigma(z_j^{(L)})) \end{aligned}and for the previous activations (the sum is needed because a single activation affects all the neurons in the following layer):
\begin{aligned} \frac{\partial C_0}{\partial a_{k}^{(L-1)}} &= \sum_{j=0}^{n_L - 1} \frac{\partial C_0}{\partial a_j^{(L)}} \cdot \frac{\partial a_j^{(L)}}{\partial z_j^{(L)}} \cdot \frac{\partial z_j^{(L)}}{\partial a_k^{(L-1)}} \\ &= \sum_{j=0}^{n_L - 1} 2(a_j^{(L)} - y_j) \cdot \sigma(z_j^{(L)}) (1 - \sigma(z_j^{(L)})) \cdot w_{jk}^{(L)} \end{aligned}This is it! The last three formulas we wrote are the backbone of the backpropagation algorithm. The complex part now is converting all of these into code. As I was writing at the beginning, converting this algorithm into code, is complex for 2 reasons:
- A lot of math
- Several layers of complexity:
- Iterate over samples
- Iterate backwards over layers
- Iterate over neurons
- Iterate over previous layer neurons
Backpropagation – Implementation
Let’s start, as always, from our nn.h
file, and let’s add the backpropagation function to the definitions:
void nn_backpropagation(NeuralNetwork nn, NeuralNetwork grad, Matrix train_in, Matrix train_out);
And now let’s move on to the declaration. The idea is to use the grad
data structure directly when calculating the gradient. As I said before, we will compute the gradients by iterating over several layers, which means using several for
loops. In the innermost loops, we will compute the partial derivatives for the biases, the weights and the activations and add them to the gradient data structure. Then we will average them all, dividing each of them by the number of samples.
Let’s start by defining the for
loops and the normalization:
void nn_backpropagation(NeuralNetwork nn, NeuralNetwork grad, Matrix ti, Matrix to)
{
size_t num_samples = ti.rows;
// Iterate over the samples
for (size_t i = 0; i < num_samples; i++)
{
// Iterate over the layers
for (size_t l = nn.num_layers; l > 0; l--)
{
// Iterate over the neurons
for (size_t j = 0; j < nn.activations[l].cols; j++)
{
// Iterate over the previous layer neurons
for (size_t k = 0; k < nn.activations[l - 1].cols; k++)
{
/* code */
}
}
}
}
// Average gradients over all samples
for (size_t i = 0; i < grad.num_layers; i++)
{
for (size_t j = 0; j < grad.weights[i].rows; j++)
{
for (size_t k = 0; k < grad.weights[i].cols; k++)
{
MAT_AT(grad.weights[i], j, k) /= num_samples;
}
}
for (size_t j = 0; j < grad.biases[i].rows; j++)
{
for (size_t k = 0; k < grad.biases[i].cols; k++)
{
MAT_AT(grad.biases[i], j, k) /= num_samples;
}
}
}
}
Since we are using the grad
data structure, and it is always the same in each iteration of the gradient descent, we need to fill it with zeros before using it. To do so, let’s define a function to do precisely that:
// Definition
void nn_fill(NeuralNetwork nn, float val);
// Declaration
void nn_fill(NeuralNetwork nn, float val)
{
for (size_t i = 0; i < nn.num_layers; i++)
{
mat_fill(nn.weights[i], val);
mat_fill(nn.biases[i], val);
}
}
Nice, back to the backpropagation function. From now on, I will omit the normalization part, and focus only on the nested loops.
void nn_backpropagation(NeuralNetwork nn, NeuralNetwork grad, Matrix ti, Matrix to)
{
nn_fill(grad, 0.0f);
size_t num_samples = ti.rows;
// Iterate over the samples
for (size_t i = 0; i < num_samples; i++)
{
// Iterate over the layers
for (size_t l = nn.num_layers; l > 0; l--)
{
// Iterate over the neurons
for (size_t j = 0; j < nn.activations[l].cols; j++)
{
// Iterate over the previous layer neurons
for (size_t k = 0; k < nn.activations[l - 1].cols; k++)
{
/* code */
}
}
}
}
/* [...] Average gradients over all samples [...] */
}
When we iterate over the samples, we first need to forward each sample. So let’s start by copying the values into the input neurons and pass them on:
void nn_backpropagation(NeuralNetwork nn, NeuralNetwork grad, Matrix ti, Matrix to)
{
nn_fill(grad, 0.0f);
size_t num_samples = ti.rows;
// Iterate over the samples
for (size_t i = 0; i < num_samples; i++)
{
mat_cpy(NN_INPUT(nn), mat_row(ti, i));
nn_forward(nn);
// Iterate over the layers
for (size_t l = nn.num_layers; l > 0; l--)
{
// Iterate over the neurons
for (size_t j = 0; j < nn.activations[l].cols; j++)
{
// Iterate over the previous layer neurons
for (size_t k = 0; k < nn.activations[l - 1].cols; k++)
{
/* code */
}
}
}
}
/* [...] Average gradients over all samples [...] */
}
We add details at each step, so it should be fairly easy to follow. Each time an error is computed and propagated backwards, the activations of the gradient network are populated. So we need to fill each of them with 0s at each iteration:
void nn_backpropagation(NeuralNetwork nn, NeuralNetwork grad, Matrix ti, Matrix to)
{
nn_fill(grad, 0.0f);
size_t num_samples = ti.rows;
// Iterate over the samples
for (size_t i = 0; i < num_samples; i++)
{
mat_cpy(NN_INPUT(nn), mat_row(ti, i));
nn_forward(nn);
for (size_t j = 0; j < grad.num_layers; j++)
{
mat_fill(grad.activations[j], 0.0f);
}
// Iterate over the layers
for (size_t l = nn.num_layers; l > 0; l--)
{
// Iterate over the neurons
for (size_t j = 0; j < nn.activations[l].cols; j++)
{
// Iterate over the previous layer neurons
for (size_t k = 0; k < nn.activations[l - 1].cols; k++)
{
/* code */
}
}
}
}
/* [...] Average gradients over all samples [...] */
}
We are still in the first loop, but there is still something we can do here. Before the backpropagation step, we have to compute something to actually backpropagate. More specifically, the partial derivative of the cost function w.r.t. the output of the neural network. This means here we are defining the change we have to make to reduce the error in the output layer:
\frac{\partial C_i}{\partial a_j^{(L)}} = 2(a_j^{(L)} - y_j) \cdot a_j^{(L)} (1 - a_j^{(L)})Here I wrote directly a_j^{(L)} instead of \sigma(z_j^{(L)}) for simplicity.
Let’s code this:
void nn_backpropagation(NeuralNetwork nn, NeuralNetwork grad, Matrix ti, Matrix to)
{
nn_fill(grad, 0.0f);
size_t num_samples = ti.rows;
// Iterate over the samples
for (size_t i = 0; i < num_samples; i++)
{
mat_cpy(NN_INPUT(nn), mat_row(ti, i));
nn_forward(nn);
for (size_t j = 0; j < grad.num_layers; j++)
{
mat_fill(grad.activations[j], 0.0f);
}
for (size_t j = 0; j < to.cols; j++)
{
MAT_AT(NN_OUTPUT(grad), 0, j) = 2 * (MAT_AT(NN_OUTPUT(nn), 0, j) - MAT_AT(to, i, j)) *
MAT_AT(NN_OUTPUT(nn), 0, j) * (1 - MAT_AT(NN_OUTPUT(nn), 0, j));
}
// Iterate over the layers
for (size_t l = nn.num_layers; l > 0; l--)
{
// Iterate over the neurons
for (size_t j = 0; j < nn.activations[l].cols; j++)
{
// Iterate over the previous layer neurons
for (size_t k = 0; k < nn.activations[l - 1].cols; k++)
{
/* code */
}
}
}
}
/* [...] Average gradients over all samples [...] */
}
Ok, now we can actually backpropagate everything, and enter the inner loops. We jump directly into the loop j
, where we can instantly compute the bias for the neuron j
as it is just a single value. Let’s recall the formula for the bias:
That is exactly what we computed before. So let’s add it:
void nn_backpropagation(NeuralNetwork nn, NeuralNetwork grad, Matrix ti, Matrix to)
{
nn_fill(grad, 0.0f);
size_t num_samples = ti.rows;
// Iterate over the samples
for (size_t i = 0; i < num_samples; i++)
{
mat_cpy(NN_INPUT(nn), mat_row(ti, i));
nn_forward(nn);
for (size_t j = 0; j < grad.num_layers; j++)
{
mat_fill(grad.activations[j], 0.0f);
}
for (size_t j = 0; j < to.cols; j++)
{
MAT_AT(NN_OUTPUT(grad), 0, j) = 2 * (MAT_AT(NN_OUTPUT(nn), 0, j) - MAT_AT(to, i, j)) *
MAT_AT(NN_OUTPUT(nn), 0, j) * (1 - MAT_AT(NN_OUTPUT(nn), 0, j));
}
// Iterate over the layers
for (size_t l = nn.num_layers; l > 0; l--)
{
// Iterate over the neurons
for (size_t j = 0; j < nn.activations[l].cols; j++)
{
float dC_dA = MAT_AT(grad.activations[l], 0, j);
MAT_AT(grad.biases[l-1], 0, j) += dC_dA;
// Iterate over the previous layer neurons
for (size_t k = 0; k < nn.activations[l - 1].cols; k++)
{
/* code */
}
}
}
}
/* [...] Average gradients over all samples [...] */
}
Now we can compute the weights and the previous layer’s activations derivatives:
\frac{\partial C_0}{\partial w_{jk}^{(L)}} = 2(a_j^{(L)} - y_j) \cdot \sigma(z_j^{(L)}) (1 - \sigma(z_j^{(L)})) \cdot a_j^{(L-1)}and
\frac{\partial C_0}{\partial a_{k}^{(L-1)}} = \sum_{j=0}^{n_L - 1} 2(a_j^{(L)} - y_j) \cdot \sigma(z_j^{(L)}) (1 - \sigma(z_j^{(L)})) \cdot w_{jk}^{(L)}That translated into code becomes:
void nn_backpropagation(NeuralNetwork nn, NeuralNetwork grad, Matrix ti, Matrix to)
{
nn_fill(grad, 0.0f);
size_t num_samples = ti.rows;
// Iterate over the samples
for (size_t i = 0; i < num_samples; i++)
{
mat_cpy(NN_INPUT(nn), mat_row(ti, i));
nn_forward(nn);
for (size_t j = 0; j < grad.num_layers; j++)
{
mat_fill(grad.activations[j], 0.0f);
}
for (size_t j = 0; j < to.cols; j++)
{
MAT_AT(NN_OUTPUT(grad), 0, j) = 2 * (MAT_AT(NN_OUTPUT(nn), 0, j) - MAT_AT(to, i, j)) *
MAT_AT(NN_OUTPUT(nn), 0, j) * (1 - MAT_AT(NN_OUTPUT(nn), 0, j));
}
// Iterate over the layers
for (size_t l = nn.num_layers; l > 0; l--)
{
// Iterate over the neurons
for (size_t j = 0; j < nn.activations[l].cols; j++)
{
float dC_dA = MAT_AT(grad.activations[l], 0, j);
MAT_AT(grad.biases[l-1], 0, j) += dC_dA;
// Iterate over the previous layer neurons
for (size_t k = 0; k < nn.activations[l - 1].cols; k++)
{
MAT_AT(grad.weights[l - 1], k, j) += dC_dA * MAT_AT(nn.activations[l - 1], 0, k);
MAT_AT(grad.activations[l - 1], 0, k) += dC_dA * MAT_AT(nn.weights[l - 1], k, j) *
MAT_AT(nn.activations[l - 1], 0, k) * (1 - MAT_AT(nn.activations[l - 1], 0, k)); }
}
}
}
/* [...] Average gradients over all samples [...] */
}
Note that it is sufficient to write grad.activations[l - 1]
to refer to the previous layer activations and not l - 2
because the activation matrix array has one element more than the weights (the input).
And that’s it! Here’s the final version:
void nn_backpropagation(NeuralNetwork nn, NeuralNetwork grad, Matrix ti, Matrix to)
{
nn_fill(grad, 0.0f);
size_t num_samples = ti.rows;
for (size_t i = 0; i < num_samples; i++)
{
// Forward
mat_cpy(NN_INPUT(nn), mat_row(ti, i));
nn_forward(nn);
for (size_t j = 0; j < grad.num_layers; j++)
{
mat_fill(grad.activations[j], 0.0f);
}
for (size_t j = 0; j < to.cols; j++)
{
MAT_AT(NN_OUTPUT(grad), 0, j) = 2 * (MAT_AT(NN_OUTPUT(nn), 0, j) - MAT_AT(to, i, j)) *
MAT_AT(NN_OUTPUT(nn), 0, j) * (1 - MAT_AT(NN_OUTPUT(nn), 0, j));
}
// Backward pass
for (size_t l = nn.num_layers; l > 0; l--)
{
for (size_t j = 0; j < nn.activations[l].cols; j++)
{
float dC_dA = MAT_AT(grad.activations[l], 0, j);
MAT_AT(grad.biases[l-1], 0, j) += dC_dA;
for (size_t k = 0; k < nn.activations[l - 1].cols; k++)
{
MAT_AT(grad.weights[l - 1], k, j) += dC_dA * MAT_AT(nn.activations[l - 1], 0, k);
MAT_AT(grad.activations[l - 1], 0, k) += dC_dA * MAT_AT(nn.weights[l - 1], k, j) *
MAT_AT(nn.activations[l - 1], 0, k) * (1 - MAT_AT(nn.activations[l - 1], 0, k));
}
}
}
}
for (size_t i = 0; i < grad.num_layers; i++)
{
for (size_t j = 0; j < grad.weights[i].rows; j++)
{
for (size_t k = 0; k < grad.weights[i].cols; k++)
{
MAT_AT(grad.weights[i], j, k) /= num_samples;
}
}
for (size_t j = 0; j < grad.biases[i].rows; j++)
{
for (size_t k = 0; k < grad.biases[i].cols; k++)
{
MAT_AT(grad.biases[i], j, k) /= num_samples;
}
}
}
}
We can now update the gradient descent function accordingly by commenting out the finite difference function and removing the epsilon parameter. We don’t need it anymore as we are calculating the actual derivative and not an approximation!
void nn_gradient_descent(NeuralNetwork nn, float rate, Matrix train_in, Matrix train_out, size_t iterations)
{
NeuralNetwork grad = nn_alloc(nn.archi, nn.num_layers);
for (size_t it = 0; it < iterations; it++)
{
//nn_finite_diff(nn, grad, 1e-1, train_in, train_out);
nn_backpropagation(nn, grad, train_in, train_out);
for (size_t i = 0; i < nn.num_layers; i++)
{
for (size_t j = 0; j < nn.weights[i].rows; j++)
{
for (size_t k = 0; k < nn.weights[i].cols; k++)
{
MAT_AT(nn.weights[i], j, k) -= rate * MAT_AT(grad.weights[i], j, k);
}
}
for (size_t j = 0; j < nn.biases[i].rows; j++)
{
for (size_t k = 0; k < nn.biases[i].cols; k++)
{
MAT_AT(nn.biases[i], j, k) -= rate * MAT_AT(grad.biases[i], j, k);
}
}
}
}
}
Results
Let’s use exactly the same program we used in the last episode to test the performance. We just need to make sure that we change the call to the gradient descent function by removing the epsilon:
#define NN_IMPLEMENTATION
#include "nn.h"
#include "time.h"
int main(void)
{
srand(69);
size_t archi[] = {2, 2, 1};
NeuralNetwork nn = nn_alloc(archi, ARRAY_LEN(archi) - 1);
nn_rand(nn, 0.0f, 1.0f);
float train_set[] = {
0, 0, 0,
0, 1, 1,
1, 0, 1,
1, 1, 0
};
Matrix train_in = {
.rows = 4,
.cols = 2,
.stride = 3,
.data = train_set
};
Matrix train_out = {
.rows = 4,
.cols = 1,
.stride = 3,
.data = train_set + 2
};
printf("MSE BEFORE: %f\n", nn_mse(nn, train_in, train_out));
nn_gradient_descent(nn, 10, train_in, train_out, 10*1000);
printf("MSE AFTER: %f\n", nn_mse(nn, train_in, train_out));
for (size_t i = 0; i < 4; i++)
{
mat_cpy(NN_INPUT(nn), mat_row(train_in, i));
nn_forward(nn);
printf("%f - %f: %f\n", MAT_AT(NN_INPUT(nn), 0, 0), MAT_AT(NN_INPUT(nn), 0, 1), MAT_AT(NN_OUTPUT(nn), 0, 0));
}
return 0;
}
And the output is…
MSE BEFORE: 0.356169
MSE AFTER: 0.000025
0.000000 - 0.000000: 0.004951
0.000000 - 1.000000: 0.995321
1.000000 - 0.000000: 0.994320
1.000000 - 1.000000: 0.004433
It works! And it’s much more efficient too! Look how long it takes to do a million iterations of gradient descent using the finite difference method:
MSE BEFORE: 0.356169
MSE AFTER: 0.000000
0.000000 - 0.000000: 0.000663
0.000000 - 1.000000: 0.999338
1.000000 - 0.000000: 0.999338
1.000000 - 1.000000: 0.000672
./nn 8.08s user 0.05s system 98% cpu 8.216 total
More than 8 seconds! Now look at the gradient descent with the backpropagation:
MSE BEFORE: 0.356169
MSE AFTER: 0.000000
0.000000 - 0.000000: 0.000482
0.000000 - 1.000000: 0.999550
1.000000 - 0.000000: 0.999432
1.000000 - 1.000000: 0.000438
./nn 1.67s user 0.02s system 98% cpu 1.719 total
Slightly more than 1 second!
Conclusions
The framework is ready. We have successfully created a framework for modelling arbitrary complex neural networks in C. The next episodes will be a real application of the framework on the MNIST dataset, and we will try to get the network to predict what we write!