
Welcome to the second episode of the series. In this post, we will implement a more complex model by adding one (or possibly more) parameter to the one we created in the previous episode.
This episode will introduce the concepts of neurons, bias, and activation functions.
Here’s the source code used in this series.
Artificial Neuron
Since the beginning of this research field, the aim has been to model an artificial neuron after a biological one:

As you can see, on the left of the artificial neuron, we have a set of inputs (in the first episode, it was just one, w), a linear function in the middle (in the last episode, it was y = x * w), an activation function (which we will discuss later), and the output (the same as in the last episode, y).
Let’s start by defining it with a simple data structure:
typedef struct Neuron {
float w1;
float w2;
};
Training Dataset
Now that we have 2 parameters, we can try to train it on a more complex task, such as modeling a logic gate like an OR. Below is the truth table:
x_1 | x_2 | y |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 1 |
In C, we can code it like this:
typedef float train[3];
train training_set[] = {
{0, 0, 0},
{0, 1, 1},
{1, 0, 1},
{1, 1, 1}
};
#define TRAIN_SIZE (sizeof(training_set) / sizeof(train))
Model Initialization
The procedure is always the same, so:
int main(void)
{
srand(42);
Neuron n;
n.w1 = (float)rand() / RAND_MAX;
n.w2 = (float)rand() / RAND_MAX;
return 0;
}
Performance Evaluation and Model Training
The procedure remains the same, but now we need to address more parameters, still using Mean Squared Error (MSE) and Gradient Descent. The MSE formula is now slightly different:
float mse(Neuron n)
{
float distances = 0.f;
for (size_t i = 0; i < TRAIN_SIZE; i++)
{
float x1 = training_set[i][0];
float x2 = training_set[i][1];
float y = training_set[i][2];
float y_ = x1 * n.w1 + x2 * n.w2;
distances += (y - y_) * (y - y_);
}
return distances / TRAIN_SIZE;
}
As you can see, the only difference is that we increased the number of parameters, and therefore the number of multiplications we have to perform.
The same applies to gradient descent:
Neuron gradient_descent(Neuron n, size_t iterations)
{
float eps = 1e-3;
float rate = 1e-3;
for (size_t i = 0; i < iterations; i++)
{
float dw1 = (mse((Neuron){n.w1 + eps, n.w2}) - mse(n)) / eps;
float dw2 = (mse((Neuron){n.w1, n.w2 + eps}) - mse(n)) / eps;
n.w1 -= rate * dw1;
n.w2 -= rate * dw2;
}
return n;
}
The only difference here is that we are now computing partial derivatives, meaning we are examining how the system performs when adjusting a single parameter at a time. First, we try increasing just w_1, then we try increasing just w_2, and we apply the modifications.
Let’s try to run the entire process and see how it performs! But first, let’s write a function to print the model, as it has become slightly more complex, and then run it with one iteration.
void print_model(Neuron n)
{
printf("W1: %f, W2: %f, MSE: %f\n", n.w1, n.w2, mse(n));
for (size_t i = 0; i < TRAIN_SIZE; i++)
{
printf("%f * %f + %f * %f = %f (%f)\n", training_set[i][0],
n.w1,
training_set[i][1],
n.w2,
training_set[i][0] * n.w1 + training_set[i][1] * n.w2,
training_set[i][2]);
}
}
int main(void)
{
srand(42);
Neuron n;
n.w1 = (float)rand() / RAND_MAX;
n.w2 = (float)rand() / RAND_MAX;
printf("=== Before Gradient Descent ===\n");
print_model(n);
n = gradient_descent(n, 1);
printf("\n=== After Gradient Descent ===\n");
print_model(n);
return 0;
}
Here below is the result:
=== Before Gradient Descent ===
W1: 0.000329, W2: 0.524587, MSE: 0.125224
0.000000 * 0.000329 + 0.000000 * 0.524587 = 0.000000 (0.000000)
0.000000 * 0.000329 + 1.000000 * 0.524587 = 0.524587 (0.000000)
1.000000 * 0.000329 + 0.000000 * 0.524587 = 0.000329 (0.000000)
1.000000 * 0.000329 + 1.000000 * 0.524587 = 0.524916 (1.000000)
=== After Gradient Descent ===
W1: 0.000566, W2: 0.524562, MSE: 0.125167
0.000000 * 0.000566 + 0.000000 * 0.524562 = 0.000000 (0.000000)
0.000000 * 0.000566 + 1.000000 * 0.524562 = 0.524562 (0.000000)
1.000000 * 0.000566 + 0.000000 * 0.524562 = 0.000566 (0.000000)
1.000000 * 0.000566 + 1.000000 * 0.524562 = 0.525127 (1.000000)
The MSE has slightly improved, so let’s try running it for 1000 iterations:
=== Before Gradient Descent ===
W1: 0.000329, W2: 0.524587, MSE: 0.362766
0.000000 * 0.000329 + 0.000000 * 0.524587 = 0.000000 (0.000000)
0.000000 * 0.000329 + 1.000000 * 0.524587 = 0.524587 (1.000000)
1.000000 * 0.000329 + 0.000000 * 0.524587 = 0.000329 (1.000000)
1.000000 * 0.000329 + 1.000000 * 0.524587 = 0.524916 (1.000000)
=== After Gradient Descent ===
W1: 0.417348, W2: 0.735286, MSE: 0.108214
0.000000 * 0.417348 + 0.000000 * 0.735286 = 0.000000 (0.000000)
0.000000 * 0.417348 + 1.000000 * 0.735286 = 0.735286 (1.000000)
1.000000 * 0.417348 + 0.000000 * 0.735286 = 0.417348 (1.000000)
1.000000 * 0.417348 + 1.000000 * 0.735286 = 1.152634 (1.000000)
We are improving, but not significantly. Let’s break it down.
Activation Functions
Even after optimizing the weights, the model’s output for the input (1, 1) exceeds the target value, reaching 1.152634. This happens because the current setup is a simple linear combination of inputs and weights. Linear models struggle to capture complex patterns in the data. Moreover, the output range is unrestricted, which means the network can produce values far beyond the target range, as seen with the output of 1.152634 for the input (1, 1).
A useful method in machine learning and neural networks is to add activation functions.
Activation functions introduce non-linearity to the models by computing the output from the input in specific ways. There are many activation functions, each suited for specific purposes. One of the most famous is the Sigmoid function.

The output is bounded along the vertical axis, and the function is non-linear. It is expressed mathematically as:
\sigma(x)=\frac{1}{1+e^{-x}}Let’s code it:
float sigf(float x)
{
return 1 / (1 + exp(-x));
}
Now we can use the activation function in the forward propagation process. Wait, the what?
Forward Propagation
Keep calm, it’s not something new; we’ve already used it, just without calling it by its proper name. Remember how we compute the MSE? Well, we’re essentially making predictions.
float y_ = x1 * n.w1 + x2 * n.w2;
This is it. The process of getting an output from the inputs is called forward propagation.
We can write a function for that so that if we need to modify it, we can do it just once:
float forward(Neuron n, float x1, float x2)
{
return sigf(x1 * n.w1 + x2 * n.w2);
}
And as you can see, we modified the forward propagation with the sigmoid function so that we can achieve non-linearity and bounded outputs. Let’s accordingly modify the MSE function and the neuron printing:
float mse(Neuron n)
{
float distances = 0.f;
for (size_t i = 0; i < TRAIN_SIZE; i++)
{
float x1 = training_set[i][0];
float x2 = training_set[i][1];
float y = training_set[i][2];
float y_ = forward(n, x1, x2);
distances += (y - y_) * (y - y_);
}
return distances / TRAIN_SIZE;
}
void print_model(Neuron n)
{
printf("W1: %f, W2: %f, MSE: %f\n", n.w1, n.w2, mse(n));
for (size_t i = 0; i < TRAIN_SIZE; i++)
{
printf("%f * %f + %f * %f = %f (%f)\n", training_set[i][0],
n.w1,
training_set[i][1],
n.w2,
forward(n, training_set[i][0], training_set[i][1]),
training_set[i][2]);
}
}
Let’s see how it performs now with 100000 iterations:
=== Before Gradient Descent ===
W1: 0.000329, W2: 0.524587, MSE: 0.194075
0.000000 * 0.000329 + 0.000000 * 0.524587 = 0.500000 (0.000000)
0.000000 * 0.000329 + 1.000000 * 0.524587 = 0.628220 (1.000000)
1.000000 * 0.000329 + 0.000000 * 0.524587 = 0.500082 (1.000000)
1.000000 * 0.000329 + 1.000000 * 0.524587 = 0.628296 (1.000000)
=== After Gradient Descent ===
W1: 2.025726, W2: 2.104172, MSE: 0.068911
0.000000 * 2.025726 + 0.000000 * 2.104172 = 0.500000 (0.000000)
0.000000 * 2.025726 + 1.000000 * 2.104172 = 0.891308 (1.000000)
1.000000 * 2.025726 + 0.000000 * 2.104172 = 0.883472 (1.000000)
1.000000 * 2.025726 + 1.000000 * 2.104172 = 0.984170 (1.000000)
It seems there’s an issue with the first training element. The network appears to struggle with correctly predicting some values. It would be helpful to have a mechanism to shift the activation function, adding another degree of freedom that allows it to shift towards 0. Any ideas?
Bias
This process of shifting the activation function value by a constant term is called bias. It’s a parameter of the neurons and is simply summed to the linear combination output to allow for a vertical shift. Let’s modify the Neuron structure accordingly:
typedef struct {
float w1;
float w2;
float b;
} Neuron;
But since it’s a parameter, just like the weights, we need to apply gradient descent on it as well:
Neuron gradient_descent(Neuron n, size_t iterations)
{
float eps = 1e-3;
float rate = 1e-3;
for (size_t i = 0; i < iterations; i++)
{
float dw1 = (mse((Neuron){n.w1 + eps, n.w2, n.b}) - mse(n)) / eps;
float dw2 = (mse((Neuron){n.w1, n.w2 + eps, n.b}) - mse(n)) / eps;
float db = (mse((Neuron){n.w1, n.w2, n.b + eps}) - mse(n)) / eps;
n.w1 -= rate * dw1;
n.w2 -= rate * dw2;
n.b -= rate * db;
}
return n;
}
Ok let’s see if anything changes, now with 1000 iterations:
=== Before Gradient Descent ===
W1: 0.000329, W2: 0.524587, MSE: 0.194075
0.000000 * 0.000329 + 0.000000 * 0.524587 = 0.500000 (0.000000)
0.000000 * 0.000329 + 1.000000 * 0.524587 = 0.628220 (1.000000)
1.000000 * 0.000329 + 0.000000 * 0.524587 = 0.500082 (1.000000)
1.000000 * 0.000329 + 1.000000 * 0.524587 = 0.628296 (1.000000)
=== After Gradient Descent ===
W1: 2.445868, W2: 2.491986, MSE: 0.036005
0.000000 * 2.445868 + 0.000000 * 2.491986 = 0.295156 (0.000000)
0.000000 * 2.445868 + 1.000000 * 2.491986 = 0.835004 (1.000000)
1.000000 * 2.445868 + 0.000000 * 2.491986 = 0.828551 (1.000000)
1.000000 * 2.445868 + 1.000000 * 2.491986 = 0.983166 (1.000000)
We are getting closer and closer. Let’s try some ML magic, and let’s tweak the learning rate, perhaps changing it from 1\times 10^{-3} to 5 \times 10^{-1}:
=== Before Gradient Descent ===
W1: 0.000329, W2: 0.524587, MSE: 0.194075
0.000000 * 0.000329 + 0.000000 * 0.524587 = 0.500000 (0.000000)
0.000000 * 0.000329 + 1.000000 * 0.524587 = 0.628220 (1.000000)
1.000000 * 0.000329 + 0.000000 * 0.524587 = 0.500082 (1.000000)
1.000000 * 0.000329 + 1.000000 * 0.524587 = 0.628296 (1.000000)
=== After Gradient Descent ===
W1: 9.616678, W2: 9.616679, MSE: 0.000047
0.000000 * 9.616678 + 0.000000 * 9.616679 = 0.010182 (0.000000)
0.000000 * 9.616678 + 1.000000 * 9.616679 = 0.993566 (1.000000)
1.000000 * 9.616678 + 0.000000 * 9.616679 = 0.993566 (1.000000)
1.000000 * 9.616678 + 1.000000 * 9.616679 = 1.000000 (1.000000)
Results
Et voilĂ ! Our model is now able to understand logic gates! Notice how we never explicitly told the model what an OR gate is; we simply provided a few examples, and it learned it.
Let’s try making it learn the AND gate by changing the training set:
train training_set[] = {
{0, 0, 0},
{0, 1, 0},
{1, 0, 0},
{1, 1, 1}
};
Here are the results:
=== Before Gradient Descent ===
W1: 0.000329, W2: 0.524587, MSE: 0.258226
0.000000 * 0.000329 + 0.000000 * 0.524587 = 0.500000 (0.000000)
0.000000 * 0.000329 + 1.000000 * 0.524587 = 0.628220 (0.000000)
1.000000 * 0.000329 + 0.000000 * 0.524587 = 0.500082 (0.000000)
1.000000 * 0.000329 + 1.000000 * 0.524587 = 0.628296 (1.000000)
=== After Gradient Descent ===
W1: 8.975070, W2: 8.975070, MSE: 0.000089
0.000000 * 8.975070 + 0.000000 * 8.975070 = 0.000001 (0.000000)
0.000000 * 8.975070 + 1.000000 * 8.975070 = 0.010214 (0.000000)
1.000000 * 8.975070 + 0.000000 * 8.975070 = 0.010214 (0.000000)
1.000000 * 8.975070 + 1.000000 * 8.975070 = 0.987888 (1.000000)
This is amazing! Let’s try another one, perhaps the XOR gate, whose training set is:
train training_set[] = {
{0, 0, 0},
{0, 1, 1},
{1, 0, 1},
{1, 1, 0}
};
And the results are…
=== Before Gradient Descent ===
W1: 0.000329, W2: 0.524587, MSE: 0.258224
0.000000 * 0.000329 + 0.000000 * 0.524587 = 0.500000 (0.000000)
0.000000 * 0.000329 + 1.000000 * 0.524587 = 0.628220 (1.000000)
1.000000 * 0.000329 + 0.000000 * 0.524587 = 0.500082 (1.000000)
1.000000 * 0.000329 + 1.000000 * 0.524587 = 0.628296 (0.000000)
=== After Gradient Descent ===
W1: 0.000925, W2: 0.000975, MSE: 0.250000
0.000000 * 0.000925 + 0.000000 * 0.000975 = 0.499598 (0.000000)
0.000000 * 0.000925 + 1.000000 * 0.000975 = 0.499842 (1.000000)
1.000000 * 0.000925 + 0.000000 * 0.000975 = 0.499829 (1.000000)
1.000000 * 0.000925 + 1.000000 * 0.000975 = 0.500073 (0.000000)
…bad? Wait, what’s going on?
One Neuron is Not Enough
We’ve encountered a crucial problem in machine learning, which is the reason why successful models out there are composed of multiple neurons organized in different layers, forming the so-called deep neural networks, which we will eventually explore.
With our current model, we can successfully train an AND or an OR gate, but not an XOR gate. Why is that? Let’s try to visualize, for example, the AND gate with 4 points in a 2-dimensional graph:

A model with just one neuron is suitable for problems that can be divided into just two spaces. But what about the XOR gate? Let’s try to plot it:

It doesn’t matter where we try to draw the separation line; we will always have some misclassified points.
In the next episode, we will attempt to implement a more complex model, creating our very first network capable of learning from more complex datasets.