Skip to Content
All posts

Understanding Neural Networks Through Mathematics

 — #machine-learning#mathematics#neural-networks

Understanding Neural Networks Through Mathematics

Neural networks are fundamentally mathematical constructs. In this post, we'll explore the key mathematical concepts that make neural networks work.

The Basics: Linear Transformation

At its core, a neural network layer performs a linear transformation followed by a non-linear activation. For a single neuron:

z=wTx+bz = \mathbf{w}^T \mathbf{x} + b

Where:

  • w\mathbf{w} is the weight vector
  • x\mathbf{x} is the input vector
  • bb is the bias term
  • zz is the pre-activation output

Activation Functions

Sigmoid Function

The sigmoid function maps any real number to a value between 0 and 1:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Its derivative is particularly elegant:

dσdz=σ(z)(1σ(z))\frac{d\sigma}{dz} = \sigma(z)(1 - \sigma(z))

ReLU (Rectified Linear Unit)

The ReLU function is defined as:

ReLU(z)=max(0,z)={zif z>00if z0\text{ReLU}(z) = \max(0, z) = \begin{cases} z & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}

Forward Propagation

For a multi-layer network, forward propagation can be expressed as:

a(l)=f(l)(W(l)a(l1)+b(l))\mathbf{a}^{(l)} = f^{(l)}(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)})

Where:

  • a(l)\mathbf{a}^{(l)} is the activation of layer ll
  • W(l)\mathbf{W}^{(l)} is the weight matrix for layer ll
  • b(l)\mathbf{b}^{(l)} is the bias vector for layer ll
  • f(l)f^{(l)} is the activation function for layer ll

Loss Functions

Mean Squared Error

For regression tasks, we often use MSE:

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Cross-Entropy Loss

For classification, cross-entropy is common:

CE=i=1Cyilog(y^i)\text{CE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)

Backpropagation

The heart of neural network training is backpropagation, which uses the chain rule:

Lwij(l)=Lzj(l)zj(l)wij(l)\frac{\partial L}{\partial w_{ij}^{(l)}} = \frac{\partial L}{\partial z_j^{(l)}} \frac{\partial z_j^{(l)}}{\partial w_{ij}^{(l)}}

The gradient of the loss with respect to the pre-activation is:

δj(l)=Lzj(l)\delta_j^{(l)} = \frac{\partial L}{\partial z_j^{(l)}}

And the weight update rule becomes:

wij(l)wij(l)αLwij(l)w_{ij}^{(l)} \leftarrow w_{ij}^{(l)} - \alpha \frac{\partial L}{\partial w_{ij}^{(l)}}

Where α\alpha is the learning rate.

Gradient Descent Optimization

The basic gradient descent update rule:

θt+1=θtαθL(θt)\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)

Momentum

Adding momentum helps with convergence:

vt+1=βvt+αθL(θt)θt+1=θtvt+1\begin{aligned} v_{t+1} &= \beta v_t + \alpha \nabla_\theta L(\theta_t) \\ \theta_{t+1} &= \theta_t - v_{t+1} \end{aligned}

Adam Optimizer

The Adam optimizer combines momentum with adaptive learning rates:

mt=β1mt1+(1β1)θL(θt)vt=β2vt1+(1β2)(θL(θt))2m^t=mt1β1tv^t=vt1β2tθt+1=θtαv^t+ϵm^t\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta L(\theta_t) \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta L(\theta_t))^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \\ \hat{v}_t &= \frac{v_t}{1 - \beta_2^t} \\ \theta_{t+1} &= \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \end{aligned}

Conclusion

Understanding the mathematics behind neural networks gives us deeper insight into how they work and how to improve them. These equations form the foundation for more advanced architectures like transformers, CNNs, and RNNs.