Implementing a Neural Network in Pure C – Ep.1

Here, I aim to illustrate my process of gaining a deeper understanding of neural networks. As usual, to truly comprehend such intricate subjects, I prefer to implement them using pure C. This initial episode is intended to provide the fundamentals of model training and evaluation, while also demonstrating how these processes function at a lower level. Here’s the source code used in this series.

Single Input, Neuron and Output

To start, let’s develop a neural network with a single input, single neuron, and single output. We will delve into the concept of neurons in the upcoming episode; for now, let’s concentrate on finding the value ( w ) to construct a model such that:

y = x * w

For instance, if x = 2, we want y = 4, implying that our initial model should have only one parameter, which must be w = 2.

Training Dataset

Indeed, we must first define a dataset that the model can utilize to learn this simple function.

C
typedef float train[2];

float training_set[] = {
	{0, 0},
	{1, 2},
	{2, 4},
	{3, 6},
	{4, 8}
};

#define TRAIN_SIZE (sizeof(training_set) / sizeof(train));

So, we defined an array of train, that is an array of 2-element arrays, where the first element is the input (x) and the second one is the expected output (y).

Model Definition (Variable Declaration)

Now that we have a training set it’s a good point to introduce the machine learning model, and we are going to initialize it randomly (kind of):

C
int main(void)
{
  srand(42);
  
  // THE MODEL
  float w = (float)rand() / MAX_RAND;
  
  return 1;
}

That’s it. Literally just 1 (one) float. We are one parameter closer to the number of parameters of Gpt-4o (\sim200 \text{ billion}).

Performance Evaluation

First and foremost, we aim to determine whether the performance of this model is good or not. To achieve this, we will measure the difference between the obtained result and the expected one, which can be quantified by calculating the average of these differences across all training elements. To ensure that even smaller differences have a significant impact on the measure, we will square the average. This metric is commonly known as the Mean Squared Error. Let’s proceed by writing the function for it and printing the performance:

C
float mse(float w)
{
	float distances = 0.f;

	for (size_t i = 0; i < TRAIN_SIZE; i++)
	{
		float x = training_set[i][0];
		float y = training_set[i][1];
		float y_ = x * w;
		distances += (y - y_) * (y - y_);
	}

	return distances / TRAIN_SIZE;
}
	
int main(void)
{
	srand(42);

	float w = (float)rand() / RAND_MAX;

	printf("W: %f, MSE: %f\n", w, mse(w));

	return 0;
}

When executed, it returns:

W: 0.000329, MSE: 23.992111

Very disappointing. But we are going to fix it.

Model Training

We understand that the Mean Squared Error (MSE) is a function with one variable (w), and we aim for it to be as close to zero as possible, ideally zero itself. You can envision the MSE function as a valley, with the parameter (w) acting like a car. Our objective is to position it at the lowest point in the valley.

Left = decrease W, Right = increase W.

Essentially, we aim to adjust the parameter w (representing the car) in such a way that the value of the MSE decreases, effectively moving the car towards the bottom of the slope. However, determining the direction to move, whether left or right, depends on the slope’s direction. If the slope is descending, we need to move to the right; otherwise, we move to the left. We employ a powerful mathematical tool to understand the slope inclination: the derivative. I’ve adjusted the variable names to maintain coherence with the code we’re developing.

L = \lim_{\varepsilon \rightarrow 0}{\frac{f(w+\varepsilon)-f(w)}{\varepsilon}}

Unfortunately, computing the exact value of the derivative in w isn’t feasible, but we can attempt to approximate it. Nonetheless, the derivative is negative if the slope is descending and positive otherwise. Therefore, subtracting it from w will decrease the value if it’s positive and increase it if negative. This iterative process continues until we are sufficiently satisfied with the value of the MSE. We’ve just described a simplified explanation of Gradient Descent. Let’s code it:

C
float gradient_descent(float w, int iterations)
{
	float eps = 1e-3;
	
	for (size_t i = 0; i < iterations; i++)
	{
		float dw = (mse(w + eps) - mse(w)) / eps;
		w -= dw;
	}

	return w;
}

int main(void)
{
	srand(42);

	float w = (float)rand() / RAND_MAX;
	printf("=== Before Gradient Descent ===\n");
	printf("W: %f, MSE: %f\n", w, mse(w));

	w = gradient_descent(w, 1);

	printf("===  After Gradient Descent ===\n");
	printf("W: %f, MSE: %f\n", w, mse(w));
	return 0;
}

The float dw = (mse(w + eps) - mse(w)) / eps; line, as we were saying, is an approximation of the derivative, commonly known as Finite Difference. Let’s see how the performance is affected:

=== Before Gradient Descent ===
W: 0.000329, MSE: 23.992111
===  After Gradient Descent ===
W: 23.990957, MSE: 2901.613281

Ah, I see. That’s a common issue in machine learning. It occurs when we lose control over the gradient descent, causing it to diverge. Fortunately, we can address this by introducing a hyperparameter called the learning rate, which reduces the size of the “jump” taken during each iteration. Here’s the modified code incorporating this adjustment:

C
float gradient_descent(float w, size_t iterations)
{
	float eps = 1e-3;
	float rate = 1e-3;
	
	for (size_t i = 0; i < iterations; i++)
	{
		float dw = (mse(w + eps) - mse(w)) / eps;
		w -= rate * dw;
	}

	return w;
}

And here are the results:

=== Before Gradient Descent ===
W: 0.000329, MSE: 23.992111
===  After Gradient Descent ===
W: 0.024319, MSE: 23.419886

It worked! We slightly reduced the error. Let’s try iterating it 1000 times, instead of just 1:

=== Before Gradient Descent ===
W: 0.000329, MSE: 23.992111
===  After Gradient Descent ===
W: 1.999488, MSE: 0.000002

Results

Et voilĂ ! Very small error, and w = 1.99488, which is almost 2! Here, below, is a more detailed output with the multiplication results:

=== Before Gradient Descent ===
W: 0.000329, MSE: 23.992111
0.000000 * 0.000329 = 0.000000 (0.000000)
1.000000 * 0.000329 = 0.000329 (2.000000)
2.000000 * 0.000329 = 0.000657 (4.000000)
3.000000 * 0.000329 = 0.000986 (6.000000)
4.000000 * 0.000329 = 0.001315 (8.000000)

===  After Gradient Descent ===
W: 1.999488, MSE: 0.000002
0.000000 * 1.999488 = 0.000000 (0.000000)
1.000000 * 1.999488 = 1.999488 (2.000000)
2.000000 * 1.999488 = 3.998976 (4.000000)
3.000000 * 1.999488 = 5.998464 (6.000000)
4.000000 * 1.999488 = 7.997952 (8.000000)

I’ve also plotted the MSE values as the training progressed:

In the next episode, we will move closer to GPT-4o by adding another parameter and trying to model some logic gates.

Join the ConversationLeave a reply

Your email address will not be published. Required fields are marked *

Comment*

Name*

Website