
Welcome to the fourth episode of this series. In this episode, we will try to build a framework that allows for the implementation and training of arbitrarily complex neural networks. We will start by looking at some linear algebra, then we will cover a bit of C implementation standards, and build our framework based on this knowledge.
Linear Algebra Fundamentals
Let’s take a look at the model we used in the last episode and analyse the mathematics behind it. I promise we need to.

Let’s focus on the first part of the forward propagation, i.e. getting the outputs of the first two neurons in the first layer, and express it mathematically:
\begin{aligned} y_1 = x_1 \cdot w_{11} + x_2 \cdot w_{12} + b_1 \\ y_2 = x_1 \cdot w_{21} + x_2 \cdot w_{22} + b_2 \end{aligned}Do you notice any patterns? Well, if you have studied a bit of linear algebra, you may have noticed that this is another way of writing matrix operations:
\begin{bmatrix} y_1 \\ y_2 \end{bmatrix} = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \end{bmatrix}This is only for the first layer, and the results are passed on to the next layer, where the same process takes place:
\begin{bmatrix} y \end{bmatrix} = \begin{bmatrix} w_{31} & w_{32} \end{bmatrix} \cdot \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} + \begin{bmatrix} b_3 \end{bmatrix}As you can see, by treating each forwarding process as a matrix operation, we can always compute arbitrarily complex operations rather than hard-coding the forwarding function. This makes it more scalable and efficient, as we have systems today that are very efficient with matrix operations.
However, remember that when we multiply matrices, we always have to ensure that the number of columns in the first matrix is the same as the number of rows in the second matrix. If we want to build fully connected neural networks, this works perfectly!
Let’s have a quick look at matrix operations again, it will help us later. The dot product of two matrices A and B, denoted A \cdot B), is calculated by taking the sum of the products of the corresponding elements in the matrices. If A is an m \times n matrix and B is an n \times p matrix, then the dot product A \cdot B gives an m \times p matrix.
(A\cdot B)_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}where A_{ik} represents the element in the i^{th} row and k^{th} column of matrix A, and B_{kj} represents the element in the k^{th} row and j^{th} column of matrix B.
The sum of two matrices A and B, denoted as A + B, is calculated by adding corresponding elements in the matrices. If A and B are both m \times n matrices, then the sum A + B results in an m \times n matrix.
(A + B)_{ij} = A_{ij} + B_{ij}where A_{ij} represents the element in the i^{th} row and j^{th} column of matrix A, and B_{ij} represents the element in the i^{th} row and j^{th} column of matrix B.
Now that we have briefly reviewed the basics of linear algebra that we need, we can start implementing the neural network framework.
Single-Header Framework Setup
The framework implementation will follow the guidelines for stb-like C/C++ libraries. Such libraries should be single file and public domain. The main reason is to keep things simple and easy to follow/implement.
So let’s start by creating the header file: nn.h
, and let’s write the first few lines of code:
#ifndef NN_H
#define NN_H
// Declarations
#endif // NN_H
#ifdef NN_IMPLEMENTATION
// Definitions
#endif // NN_IMPLEMENTATION
The cool thing about stb libraries is that they can act as both headers and implementations. But how do we use them? It’s very simple, we just write a nn.c
file like this:
#define NN_IMPLEMENTATION
#include "nn.h"
int main(void) {
return 0;
}
By defining the macro NN_IMPLEMENTATION
we make sure that all the definitions are included as well.
Building the Framework
Now that we know what kind of algebraic operations we need to code, and where to code them, it’s a good time to start developing the framework itself! We will start by defining the matrix data structure, which will represent the matrices we saw earlier, and then the operations we need.
Matrix Definition
Since we want to be able to deal with arbitrarily large matrices, we need to define the matrix in a slightly different way than usual. Moreover, we can start adding some utility functions we will for sure need eventually.
#ifndef NN_H
#define NN_H
#include <stddef.h>
typedef struct {
size_t rows;
size_t cols;
float *data;
} Matrix;
Matrix mat_alloc(size_t rows, size_t cols);
void mat_print(Matrix m);
float rand_float(void);
void mat_rand(Matrix m, float min, float max);
void mat_fill(Matrix m, float val);
#endif // NN_H
We define the size and point to the memory location from which we want to retrieve data. So, for example, we can do something like:
float data[] = {1, 2, 3, 4, 5, 6};
Matrix m = {2, 3, data};
Moreover, we declared a specific function called mat_alloc
whose scope is to safely allocate and initialize the matrix data structure. The implementation is the following:
Matrix mat_alloc(size_t rows, size_t cols)
{
Matrix m;
m.rows = rows;
m.cols = cols;
m.data = malloc(rows * cols * sizeof(*m.data));
assert(m.data != NULL);
return m;
}
Assuring that we have correctly reserved enough space for the data in the matrix (with assert
) may be particularly useful when dealing with large amounts of data, which we will eventually do.
Another useful function is mat_print
, which we’ll use to look into the matrices as we try to understand them. The implementation is the following:
void mat_print(Matrix m)
{
for (size_t i = 0; i < m.rows; i++)
{
for (size_t j = 0; j < m.cols; j++)
{
printf("%f ", m.data[i * m.cols + j]);
}
printf("\n");
}
}
Now, as you can see, printing is quite difficult to understand because we are dealing with linear memory and we need to do some maths to access the right element. A simple workaround is to use a macro to make it easier:
#define MAT_AT(m, i, j) (m).data[(i) * (m).cols + (j)]
Therefore, the mat_print
function is modified as follows:
void mat_print(Matrix m)
{
for (size_t i = 0; i < m.rows; i++)
{
for (size_t j = 0; j < m.cols; j++)
{
printf("%f ", MAT_AT(m, i, j));
}
printf("\n");
}
}
Let’s try to print something with:
#define NN_IMPLEMENTATION
#include "nn.h"
int main(void) {
Matrix m = mat_alloc(3, 3);
mat_print(m);
return 0;
}
The output is:
0.000000 0.000000 0.000000
0.000000 0.000000 0.000000
0.000000 0.000000 0.000000
Nice! But it doesn’t look very useful so far, let’s write the function to randomly initialise the matrix:
float rand_float(void)
{
return (float)rand() / RAND_MAX;
}
void mat_rand(Matrix m, float min, float max)
{
for (size_t i = 0; i < m.rows; i++)
{
for (size_t j = 0; j < m.cols; j++)
{
MAT_AT(m, i, j) = rand_float() * (max - min) + min;
}
}
}
It may be useful, sometimes, to have specific boundaries between which to generate values. For example, if we run it now with ranges from 0 to 5 we get:
0.000039 0.657689 3.778027
2.293251 2.663836 1.094796
0.235223 3.394324 3.396482
Looks like it’s working! Let’s code another useful function that we will use later to test our algebraic functions. Because we will need to test operations using some fixed value, it’s a good idea to implement the mat_fill
function:
void mat_fill(Matrix m, float val)
{
for (size_t i = 0; i < m.rows; i++)
{
for (size_t j = 0; j < m.cols; j++)
{
MAT_AT(m, i, j) = val;
}
}
}
Easy, right?
Let’s move on with the algebraic functions!
Matrix Operations
Now that we have defined the Matrix
data structure, we can move on to the algebraic operations we can perform on it. Let’s add the declarations:
#ifndef NN_H
#define NN_H
#include <stddef.h>
typedef struct {
size_t rows;
size_t cols;
float *data;
} Matrix;
Matrix mat_alloc(size_t rows, size_t cols);
void mat_print(Matrix m);
float rand_float(void);
void mat_rand(Matrix m, float min, float max);
void mat_fill(Matrix m, float val);
void mat_dot(Matrix dst, Matrix a, Matrix b);
void mat_add(Matrix dst, Matrix a);
#endif // NN_H
Note that mat_dot
and mat_add
do not return a matrix object, but compute the results in place. This ensures that the memory is safely managed up front, without the risk of memory mismanagement within the function. We also now follow the common practice of listing parameters when dealing with memory management functions. Have a look at the definition of memcpy:
void *memcpy(void dest[restrict .n], const void src[restrict .n], size_t n);
Let’s see how those are implemented:
void mat_sum(Matrix dst, Matrix a)
{
assert(dst.rows == a.rows);
assert(dst.cols == a.cols);
for (size_t i = 0; i < dst.rows; i++)
{
for (size_t j = 0; j < dst.cols; j++)
{
MAT_AT(dst, i, j) += MAT_AT(a, i, j);
}
}
}
Note that we used assert as a precondition to avoid errors and unexpected behaviour with mismatching sizes. Let’s test it:
int main(void) {
Matrix a = mat_alloc(3, 3);
Matrix b = mat_alloc(3, 3);
mat_fill(a, 1.0f);
mat_fill(b, 1.0f);
mat_sum(a, b);
mat_print(a);
return 0;
}
We are summing the elements of B into the elements of A. The resulting matrix should be all made by 2s:
2.000000 2.000000 2.000000
2.000000 2.000000 2.000000
2.000000 2.000000 2.000000
Nice! Let’s move on with the matrix’s dot product!
As we were saying before, we must be sure that the destination matrix size and the A and A matrices’ sizes are right. After that, we can easily implement what we were defining before. In particular for each element of the destination matrix:
(AB)_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}That translated into C is something like this:
void mat_dot(Matrix dst, Matrix a, Matrix b)
{
assert(dst.rows == a.rows);
assert(dst.cols == b.cols);
assert(a.cols == b.rows);
for (size_t i = 0; i < dst.rows; i++)
{
for (size_t j = 0; j < dst.cols; j++)
{
MAT_AT(dst, i, j) = 0.0f;
for (size_t k = 0; k < a.cols; k++)
{
MAT_AT(dst, i, j) += MAT_AT(a, i, k) * MAT_AT(b, k, j);
}
}
}
}
And let’s test it!
int main(void) {
Matrix a = mat_alloc(2, 3);
mat_fill(a, 1);
mat_print(a);
printf("\n");
Matrix b = mat_alloc(3, 2);
mat_fill(b, 2);
mat_print(b);
printf("\n");
printf("---------------------\n");
Matrix dst = mat_alloc(2, 2);
mat_dot(dst, a, b);
mat_print(dst);
return 0;
}
And if we run it we get:
1.000000 1.000000 1.000000
1.000000 1.000000 1.000000
2.000000 2.000000
2.000000 2.000000
2.000000 2.000000
---------------------
6.000000 6.000000
6.000000 6.000000
LGTM!
The framework is almost finished! Training neural networks is just matrix multiplication. Even if you use a fancy and powerful framework like NumPy or PyTorch, you are just using wrappers for C functions that implement exactly what we have just done! Of course, they’re much more optimised and offer much more functionality, but the core is exactly this!
Let’s move on and try to port the XOR model we built in the last episode into this framework.
Porting the XOR Model
Let’s try to port the XOR model within this framework. To do so, we can simply think about all the matrices that we will need to perform all the computations. Let’s get to it:

Model Definition
So we need an input matrix X and for each layer the weights, biases and outputs of the activation functions. For ease of use, we will also have a final matrix for the outputs. So, to sum up, we will have a model like this:
typedef struct {
Matrix x;
Matrix w1;
Matrix b1;
Matrix a1;
Matrix w2;
Matrix b2;
Matrix a2;
Matrix y;
} Xor;
Since we have specific matrices to store the results of the activation functions, it might be a good idea to add another utility function to the framework to do so:
// Declaration
void mat_sigf(Matrix a);
// Implementation
void mat_sigf(Matrix a)
{
for (size_t i = 0; i < a.rows; i++)
{
for (size_t j = 0; j < a.cols; j++)
{
MAT_AT(a, i, j) = sigf(MAT_AT(a, i, j));
}
}
}
And we can now initialize the model and print it. (Notice that I slightly changed the mat_print function to accept also matrix names)
int main(void) {
srand(time(NULL));
Xor xor;
xor.x = mat_alloc(1, 2);
xor.w1 = mat_alloc(2, 2);
xor.b1 = mat_alloc(1, 2);
xor.a1 = mat_alloc(1, 2);
xor.w2 = mat_alloc(2, 1);
xor.b2 = mat_alloc(1, 1);
xor.a2 = mat_alloc(1, 1);
xor.y = mat_alloc(1, 1);
mat_rand(xor.w1, 0.0f, 1.0f);
mat_rand(xor.b1, 0.0f, 1.0f);
mat_rand(xor.w2, 0.0f, 1.0f);
mat_rand(xor.b2, 0.0f, 1.0f);
mat_print(xor.w1, "w1");
mat_print(xor.b1, "b1");
mat_print(xor.w2, "w2");
mat_print(xor.b2, "b2");
return 0;
}
And the result is something like:
w1: [
0.466318 0.411265
0.125399 0.587169
]
b1: [
0.552376
0.789951
]
w2: [
0.713223 0.146254
]
b2: [
0.089136
]
Yes, I know the initialisation looks awful, but we will eventually fix everything and use arrays. The reason I don’t do everything this way is that I want to make the process intuitive and guide the reasoning process through a lot of copy-paste to make it clear why we need some slightly more complex, but more elegant, data structures.
Forward Propagation
Ok, we have a model, but how are we supposed to forward the computation and get the output from the whole neural network? Well, now it’s just a matter of applying the functions we defined above. So let’s write the XOR forwarding function. Before that, it might be the time to add another useful utility function: mat_cpy
. This will allow us to copy the result of the last activation matrix to the network output:
// Definition
void mat_cpy(Matrix dst, Matrix src);
// Declaration
void mat_cpy(Matrix dst, Matrix src)
{
assert(dst.rows == src.rows);
assert(dst.cols == src.cols);
for (size_t i = 0; i < dst.rows; i++)
{
for (size_t j = 0; j < dst.cols; j++)
{
MAT_AT(dst, i, j) = MAT_AT(src, i, j);
}
}
}
It’s not something completely new, so I’ll leave it up to you to understand. Now we can write the forwarding function:
void forward(Xor xor) {
mat_dot(xor.a1, xor.x, xor.w1);
mat_sum(xor.a1, xor.b1);
mat_sigf(xor.a1);
mat_dot(xor.a2, xor.a1, xor.w2);
mat_sum(xor.a2, xor.b2);
mat_sigf(xor.a2);
mat_cpy(xor.y, xor.a2);
}
I bet you can see a pattern here, and that’s where we’re going to use loops and arrays. But let’s stick with this for now.
First, we compute the dot product of the input matrix xor.x
and the weight matrix xor.w1
, storing the result in xor.a1
. This is done using mat_dot(xor.a1, xor.x, xor.w1)
.
Next, we add the bias xor.b1
to xor.a1
with mat_sum(xor.a1, xor.b1)
. This step adjusts the computed values by incorporating the bias.
We then apply the sigmoid activation function to the updated xor.a1
by calling mat_sigf(xor.a1)
. This activation function introduces non-linearity to the model, which is crucial for learning complex patterns.
The process is repeated for the next layer: we compute the dot product of the activated values from the previous layer xor.a1
and the weight matrix xor.w2
, storing the result in xor.a2
using mat_dot(xor.a2, xor.a1, xor.w2)
.
We add the bias xor.b2
to xor.a2
with mat_sum(xor.a2, xor.b2)
, adjusting the values with the bias for this layer.
Once more, we apply the sigmoid activation function to the updated xor.a2
by calling mat_sigf(xor.a2)
. This step ensures that the output values are within the desired range, typically between 0 and 1.
Finally, we copy the values from xor.a2
to xor.y
using mat_cpy(xor.y, xor.a2)
. This sets the network’s output to be the computed values in xor.a2
.
Together, These steps make up the neural network’s forward pass, transforming the input through the layers to produce the final output.
Let’s see how it performs:
int main(void) {
srand(time(NULL));
Xor xor;
xor.x = mat_alloc(2, 1);
xor.w1 = mat_alloc(2, 2);
xor.b1 = mat_alloc(2, 1);
xor.a1 = mat_alloc(2, 1);
xor.w2 = mat_alloc(1, 2);
xor.b2 = mat_alloc(1, 1);
xor.a2 = mat_alloc(1, 1);
xor.y = mat_alloc(1, 1);
mat_rand(xor.w1, 0.0f, 1.0f);
mat_rand(xor.b1, 0.0f, 1.0f);
mat_rand(xor.w2, 0.0f, 1.0f);
mat_rand(xor.b2, 0.0f, 1.0f);
for (size_t i = 0; i < 2; i++)
{
for (size_t j = 0; j < 2; j++)
{
MAT_AT(xor.x, 0, 0) = i;
MAT_AT(xor.x, 0, 1) = j;
forward(xor);
printf("%zu ^ %zu = %f\n", i, j, MAT_AT(xor.y, 0, 0));
}
}
return 0;
}
Notice how we are now setting the inputs directly into the model, and then let the forward
function take care of reading them. Moreover, we know the output of the XOR is just a single value so that we can access it directly with MAT_AT(xor.y, 0, 0)
. The output is, of course, really bad:
0 ^ 0 = 0.758373
0 ^ 1 = 0.758373
1 ^ 0 = 0.767486
1 ^ 1 = 0.767486
It’s time to find a good way to measure performance (again with the MSE). Let’s start by defining the MSE function.
Mean Squared Error
Before that, anyway, we will introduce some other utility functions. First of all, it will be convenient to have a function to copy a specific row of the input directly into the input array of the model, to easily iterate through the training set:
// Definition
Matrix mat_row(Matrix m, size_t row);
// Implementation
Matrix mat_row(Matrix m, size_t row)
{
return (Matrix){
.rows = 1,
.cols = m.cols,
.data = &MAT_AT(m, row, 0)
};
}
This is quite straightforward, just return a 1-row matrix, with the same number of columns as the source matrix, and make it start at the exact address of the element of the row we are dumping.
Another useful feature we can introduce is a way to get the training input and output matrices from the same training data. Since we are dealing with data that is organised linearly, we can introduce a parameter called stride, which tells the matrix where the next row starts. This means that we can have two matrices pointing to the same data (perhaps starting at different positions), with different numbers of rows and columns, and a stride value to access the correct element. This is very handy as it allows us to use the same data without copying it every time and taking up all the memory.
Let’s modify the Matrix
data structure, the MAT_AT
macro and the mat_alloc
and mat_row
functions to accommodate this new feature:
typedef struct {
size_t rows;
size_t cols;
size_t stride;
float *data;
} Matrix;
#define MAT_AT(m, i, j) (m).data[(i) * (m).stride + (j)]
Matrix mat_alloc(size_t rows, size_t cols)
{
Matrix m;
m.rows = rows;
m.cols = cols;
m.stride = cols;
m.data = malloc(rows * cols * sizeof(*m.data));
assert(m.data != NULL);
return m;
}
Matrix mat_row(Matrix m, size_t row)
{
return (Matrix){
.rows = 1,
.cols = m.cols,
.stride = m.stride
.data = &MAT_AT(m, row, 0)
};
}
Now we can move on and proceed with the implementation of the MSE function:
float mse(Xor m, Matrix train_in, Matrix train_out)
{
assert(train_in.rows == train_out.rows);
assert(train_in.cols == m.x.cols);
assert(train_out.cols == m.y.cols);
float result = 0.0f;
for (size_t i = 0; i < train_in.rows; i++)
{
Matrix x = mat_row(train_in, i);
Matrix y = mat_row(train_out, i);
mat_cpy(m.x, x);
forward(m);
for (size_t j = 0; j < train_out.cols; j++)
{
float diff = MAT_AT(m.y, 0, j) - MAT_AT(y, 0, j);
result += diff * diff;
}
}
return result / train_in.rows;
}
Let’s try to understand it line-by-line. So, first of all, we pass as parameters the model we want to operate upon, the training inputs and the expected training outputs. Notice that the implementation took into account arbitrary-sized inputs and outputs. Keep in mind that both the inputs and the outputs are always one-row matrices.
Next, we check that everything is sized correctly and initialise the result variable. From now on, the process is the same as in the previous episodes: we sum all the squared differences between the predicted output and the expected output to find the average. To do this, we iterate over each training input by copying one row at a time into the model’s input matrix x
. After copying the input values, we feed them forward and get the output in the last layer, i.e. in the model’s output matrix y
. Now we iterate over all the output elements (in the XOR model it will only be 1) and we sum all the differences.
Let’s now see how the model performs. But first, let’s define the training set and let’s define from it the training inputs and training outputs:
float train_set[] = {
0, 0, 0,
0, 1, 1,
1, 0, 1,
1, 1, 0
};
while in the main we will have:
int main(void) {
Matrix train_in = {
.cols = 2,
.rows = 4,
.stride = 3,
.data = train_set
};
Matrix train_out = {
.cols = 1,
.rows = 4,
.stride = 3,
.data = train_set + 2
};
mat_print(train_in, "train_in");
mat_print(train_out, "train_out");
return 0;
Let’s try printing the matrices to test if everything works fine:
train_in: [
0.000000 0.000000
0.000000 1.000000
1.000000 0.000000
1.000000 1.000000
]
train_out: [
0.000000
1.000000
1.000000
0.000000
]
Nice! Let’s now compute the MSE. Notice that we are not applying any learning mechanism (gradient descent) and just building everything step-by-step, so we are not expecting any good performance for the moment.
int main(void)
{
Matrix train_in = {
.cols = 2,
.rows = 4,
.stride = 3,
.data = train_set
};
Matrix train_out = {
.cols = 1,
.rows = 4,
.stride = 3,
.data = train_set + 2
};
srand(time(NULL));
Xor xor;
xor.x = mat_alloc(1, 2);
xor.w1 = mat_alloc(2, 2);
xor.b1 = mat_alloc(1, 2);
xor.a1 = mat_alloc(1, 2);
xor.w2 = mat_alloc(2, 1);
xor.b2 = mat_alloc(1, 1);
xor.a2 = mat_alloc(1, 1);
xor.y = mat_alloc(1, 1);
mat_rand(xor.w1, 0.0f, 1.0f);
mat_rand(xor.b1, 0.0f, 1.0f);
mat_rand(xor.w2, 0.0f, 1.0f);
mat_rand(xor.b2, 0.0f, 1.0f);
for (size_t i = 0; i < 2; i++)
{
for (size_t j = 0; j < 2; j++)
{
MAT_AT(xor.x, 0, 0) = i;
MAT_AT(xor.x, 0, 1) = j;
forward(xor);
printf("%zu ^ %zu = %f\n", i, j, MAT_AT(xor.y, 0, 0));
}
}
printf("MSE: %f\n", mse(xor, train_in, train_out));
return 0;
}
The output is:
0 ^ 0 = 0.831721
0 ^ 1 = 0.831721
1 ^ 0 = 0.847161
1 ^ 1 = 0.847161
MSE: 0.371730
As in the previous episodes, the last thing we need to add is a learning mechanism, i.e. gradient descent, to improve the MSE and get it to zero in an iterative process.
Gradient Descent
I would like to address the gradient descent process in a slightly more elegant way. What I mean, is that until now we computed derivatives (actually partial derivatives) in place, while it might be a good idea to compute all the partial derivatives in a separate function and return the gradient. A gradient of a function in a specific point is just the set of all the partial derivatives in that specific point, something like:
\nabla f(p)=\left[\begin{array}{c} \frac{\partial f}{\partial x_1}(p) \\ \vdots \\ \frac{\partial f}{\partial x_n}(p) \end{array}\right]For us, the vector containing all the partial derivatives (computed, again, with the finite difference method that we saw in the previous episodes), will be the Xor model itself. We will, indeed, use another Xor model just to store the partial derivatives, and then iterate, in a separate function, we will apply iteratively all the computed values. But first, since we are going to allocate another model, let’s write a function just for that:
Xor xor_alloc()
{
Xor m;
m.x = mat_alloc(1, 2);
m.w1 = mat_alloc(2, 2);
m.b1 = mat_alloc(1, 2);
m.a1 = mat_alloc(1, 2);
m.w2 = mat_alloc(2, 1);
m.b2 = mat_alloc(1, 1);
m.a2 = mat_alloc(1, 1);
m.y = mat_alloc(1, 1);
return m;
}
Now we can go on with the finite_difference function that will compute all the partial derivatives for all the parameters of the model:
void finite_difference(Xor m, Xor grad, float eps, Matrix train_in, Matrix train_out)
{
float loss = mse(m, train_in, train_out);
float saved;
for (size_t i = 0; i < m.w1.rows; i++)
{
for (size_t j = 0; j < m.w1.cols; j++)
{
saved = MAT_AT(m.w1, i, j);
MAT_AT(m.w1, i, j) += eps;
MAT_AT(grad.w1, i, j) = (mse(m, train_in, train_out) - loss) / eps;
MAT_AT(m.w1, i, j) = saved;
}
}
for (size_t i = 0; i < m.b1.rows; i++)
{
for (size_t j = 0; j < m.b1.cols; j++)
{
saved = MAT_AT(m.b1, i, j);
MAT_AT(m.b1, i, j) += eps;
MAT_AT(grad.b1, i, j) = (mse(m, train_in, train_out) - loss) / eps;
MAT_AT(m.b1, i, j) = saved;
}
}
for (size_t i = 0; i < m.w2.rows; i++)
{
for (size_t j = 0; j < m.w2.cols; j++)
{
saved = MAT_AT(m.w2, i, j);
MAT_AT(m.w2, i, j) += eps;
MAT_AT(grad.w2, i, j) = (mse(m, train_in, train_out) - loss) / eps;
MAT_AT(m.w2, i, j) = saved;
}
}
for (size_t i = 0; i < m.b2.rows; i++)
{
for (size_t j = 0; j < m.b2.cols; j++)
{
saved = MAT_AT(m.b2, i, j);
MAT_AT(m.b2, i, j) += eps;
MAT_AT(grad.b2, i, j) = (mse(m, train_in, train_out) - loss) / eps;
MAT_AT(m.b2, i, j) = saved;
}
}
}
What we are doing here is that we are iterating over the four matrices composing our model, and computing the finite difference for each parameter. For each parameter, we are saving its value and restoring it after computing the change in the MSE value.
Now, we can easily implement the gradient descent function:
void gradient_descent(Xor m, float rate, float eps, Matrix train_in, Matrix train_out, size_t iterations)
{
Xor grad = xor_alloc();
for (size_t i = 0; i < iterations; i++)
{
finite_difference(m, grad, eps, train_in, train_out);
for (size_t j = 0; j < m.w1.rows; j++)
{
for (size_t k = 0; k < m.w1.cols; k++)
{
MAT_AT(m.w1, j, k) -= rate * MAT_AT(grad.w1, j, k);
}
}
for (size_t j = 0; j < m.b1.rows; j++)
{
for (size_t k = 0; k < m.b1.cols; k++)
{
MAT_AT(m.b1, j, k) -= rate * MAT_AT(grad.b1, j, k);
}
}
for (size_t j = 0; j < m.w2.rows; j++)
{
for (size_t k = 0; k < m.w2.cols; k++)
{
MAT_AT(m.w2, j, k) -= rate * MAT_AT(grad.w2, j, k);
}
}
for (size_t j = 0; j < m.b2.rows; j++)
{
for (size_t k = 0; k < m.b2.cols; k++)
{
MAT_AT(m.b2, j, k) -= rate * MAT_AT(grad.b2, j, k);
}
}
}
}
I want to stress out how horrible this looks. Imagine that we want to add another layer; this means we will have to add hard-code other 2 for loops both in the finite difference function and in the gradient descent. We can easily solve this by adding a loop, as we were saying before, and make it more elegant and scalable. But it’s something we will do in the next episode.
Results
Now, let’s modify accordingly the program’s entry point and see how it performs on 10.000 iterations:
int main(void)
{
Matrix train_in = {
.cols = 2,
.rows = 4,
.stride = 3,
.data = train_set
};
Matrix train_out = {
.cols = 1,
.rows = 4,
.stride = 3,
.data = train_set + 2
};
srand(time(NULL));
Xor xor = xor_alloc();
printf("MSE BEFORE: %f\n", mse(xor, train_in, train_out));
gradient_descent(xor, 5e-1, 1e-1, train_in, train_out, 10*1000);
printf("MSE AFTER: %f\n", mse(xor, train_in, train_out));
for (size_t i = 0; i < 2; i++)
{
for (size_t j = 0; j < 2; j++)
{
MAT_AT(xor.x, 0, 0) = i;
MAT_AT(xor.x, 0, 1) = j;
forward(xor);
printf("%zu ^ %zu = %f\n", i, j, MAT_AT(xor.y, 0, 0));
}
}
return 0;
}
With the output:
MSE BEFORE: 0.275863
MSE AFTER: 0.000722
0 ^ 0 = 0.027539
0 ^ 1 = 0.972974
1 ^ 0 = 0.972977
1 ^ 1 = 0.025872
It works! We have succesfully ported our model able to learn the XOR logic gate to our newly implemented framework.
In the next episode, we will try to make the whole framework more scalable, by adding utilities to build arbitrary complex networks without the need to hard-code the training functions.