
Welcome to the third episode of this series. In this episode, we will implement a more complex model and try to learn a more complex data set. In particular, we will try to build a real neural network with 3 neurons and make it learn the XOR logic function.
In this episode, we will start by extending the model we built in the last episode, and then implement a more complex neural network to test if we can learn from an XOR data set.
Here’s the source code used in this series.
Architecture Definition
We have seen that a single neuron is not enough to model the XOR gate. So we know we want to add more neurons. But how many? How should we organise them? How many layers do we need? Neural network architecture is a whole branch of research in itself, and it’s beyond the scope of this article to fully understand it, but we can still build the right neural network.

On the left, for example, we have the XOR electronic symbol, while on the right we have an equivalent model built with simpler logic gates, namely an OR, a NAND and an AND.
Since we know that the model we used in the last episode works with OR, NAND and AND, we can use the same model in a more complex network, like the one below:

Model Implementation and Training Dataset
Now that we have defined it, let’s put it down into code:
typedef struct {
float w1;
float w2;
float b;
} Neuron;
typedef struct {
Neuron n1;
Neuron n2;
Neuron n3;
} Network;
typedef float train[3];
train training_set[] = {
{0, 0, 0},
{0, 1, 1},
{1, 0, 1},
{1, 1, 0}
};
#define TRAIN_SIZE (sizeof(training_set) / sizeof(train))
We used the same data structure as in the last episode to define the network, using single neurons to keep it tidy. However, the number of parameters of the network is just the sum of all the parameters of all the neurons, i.e. 3x3=9.
Forward Propagation, Mean Squared Error and Gradient Descent
We need to modify some of these functions to suit the new architecture. Let’s start with forward propagation, and write it up according to the architecture itself:
float forward(Network n, float x1, float x2)
{
float y1 = sigf(n.n1.w1 * x1 + n.n1.w2 * x2 + n.n1.b);
float y2 = sigf(n.n2.w1 * x1 + n.n2.w2 * x2 + n.n2.b);
float y3 = sigf(n.n3.w1 * y1 + n.n3.w2 * y2 + n.n3.b);
return y3;
}
Here we used the same activation function as in the last episode, the sigmoid.
The MSE function varies a little as we calculate it over the whole network, taking into account the two inputs and the final output:
float mse(Network n)
{
float distances = 0.f;
for (size_t i = 0; i < TRAIN_SIZE; i++)
{
float x1 = training_set[i][0];
float x2 = training_set[i][1];
float y = training_set[i][2];
float y_ = forward(n, x1, x2);
distances += (y - y_) * (y - y_);
}
return distances / TRAIN_SIZE;
}
As you can see, we pass the network itself as input to the MSE function.
The gradient descent is the most tricky because it requires finding the partial derivatives for all the parameters of the network. This will eventually become infeasible and really hard to compute, and we will introduce something called backpropagation, which helps to optimise the gradient descent process. But for now, let’s keep things simple:
Network gradient_descent(Network n, size_t iterations)
{
float eps = 1e-3;
float rate = 5e-1;
for (size_t i = 0; i < iterations; i++)
{
float dw11 = (mse((Network){(Neuron){n.n1.w1 + eps, n.n1.w2, n.n1.b}, n.n2, n.n3}) - mse(n)) / eps;
float dw12 = (mse((Network){(Neuron){n.n1.w1, n.n1.w2 + eps, n.n1.b}, n.n2, n.n3}) - mse(n)) / eps;
float db1 = (mse((Network){(Neuron){n.n1.w1, n.n1.w2, n.n1.b + eps}, n.n2, n.n3}) - mse(n)) / eps;
float dw21 = (mse((Network){n.n1, (Neuron){n.n2.w1 + eps, n.n2.w2, n.n2.b}, n.n3}) - mse(n)) / eps;
float dw22 = (mse((Network){n.n1, (Neuron){n.n2.w1, n.n2.w2 + eps, n.n2.b}, n.n3}) - mse(n)) / eps;
float db2 = (mse((Network){n.n1, (Neuron){n.n2.w1, n.n2.w2, n.n2.b + eps}, n.n3}) - mse(n)) / eps;
float dw31 = (mse((Network){n.n1, n.n2, (Neuron){n.n3.w1 + eps, n.n3.w2, n.n3.b}}) - mse(n)) / eps;
float dw32 = (mse((Network){n.n1, n.n2, (Neuron){n.n3.w1, n.n3.w2 + eps, n.n3.b}}) - mse(n)) / eps;
float db3 = (mse((Network){n.n1, n.n2, (Neuron){n.n3.w1, n.n3.w2, n.n3.b + eps}}) - mse(n)) / eps;
n.n1.w1 -= rate * dw11;
n.n1.w2 -= rate * dw12;
n.n1.b -= rate * db1;
n.n2.w1 -= rate * dw21;
n.n2.w2 -= rate * dw22;
n.n2.b -= rate * db2;
n.n3.w1 -= rate * dw31;
n.n3.w2 -= rate * dw32;
n.n3.b -= rate * db3;
}
return n;
}
I know it looks terrible, but we will fix it in the next episode.
Test
Let’s add some code to print the model details and the program’s entry point:
void print_network(Network n)
{
printf("n1: w1 = %f, w2 = %f, b = %f\n", n.n1.w1, n.n1.w2, n.n1.b);
printf("n2: w1 = %f, w2 = %f, b = %f\n", n.n2.w1, n.n2.w2, n.n2.b);
printf("n3: w1 = %f, w2 = %f, b = %f\n", n.n3.w1, n.n3.w2, n.n3.b);
printf("MSE: %f\n", mse(n));
for (size_t i = 0; i < TRAIN_SIZE; i++)
{
float x1 = training_set[i][0];
float x2 = training_set[i][1];
float y = training_set[i][2];
float y_ = forward(n, x1, x2);
printf("%f ^ %f = %f (%f)\n", x1, x2, y_, y);
}
}
int main(void)
{
srand(42);
Network n = {
{(float)rand() / RAND_MAX,
(float)rand() / RAND_MAX,
(float)rand() / RAND_MAX},
{(float)rand() / RAND_MAX,
(float)rand() / RAND_MAX,
(float)rand() / RAND_MAX},
{(float)rand() / RAND_MAX,
(float)rand() / RAND_MAX,
(float)rand() / RAND_MAX}
};
printf("=== Before Gradient Descent ===\n");
print_network(n);
n = gradient_descent(n, 100*1000);
printf("\n==== After Gradient Descent ===\n");
print_network(n);
return 0;
}
After running it, we get:
=== Before Gradient Descent ===
n1: w1 = 0.000329, w2 = 0.524587, b = 0.735424
n2: w1 = 0.263306, w2 = 0.376224, b = 0.196286
n3: w1 = 0.975874, w2 = 0.512318, b = 0.530449
MSE: 0.356283
0.000000 ^ 0.000000 = 0.813263 (0.000000)
0.000000 ^ 1.000000 = 0.834551 (1.000000)
1.000000 ^ 0.000000 = 0.818202 (1.000000)
1.000000 ^ 1.000000 = 0.838637 (0.000000)
==== After Gradient Descent ===
n1: w1 = 6.941672, w2 = 6.944701, b = -3.147482
n2: w1 = 5.141287, w2 = 5.142057, b = -7.882954
n3: w1 = 11.224638, w2 = -11.900411, b = -5.274389
MSE: 0.000051
0.000000 ^ 0.000000 = 0.008030 (0.000000)
0.000000 ^ 1.000000 = 0.993192 (1.000000)
1.000000 ^ 0.000000 = 0.993190 (1.000000)
1.000000 ^ 1.000000 = 0.006957 (0.000000)
…and it works! We found an architecture that can learn the XOR logic gate.
NOTE – We have implemented an architecture based on an alternative representation of the XOR gate. But if you look at the individual weights, you’ll see that they are very different from the weights we had in the last episode. This is because the neural network can learn differently from how we expect it to behave.
In the next episode, we will implement a neural network framework from scratch and try to port the models we have defined so far into it.