Understanding Neural Networks Through Mathematics Neural networks are fundamentally mathematical constructs. In this post, we'll explore the key mathematical concepts that make neural networks work.
The Basics: Linear Transformation At its core, a neural network layer performs a linear transformation followed by a non-linear activation. For a single neuron:
z = w T x + b z = \mathbf{w}^T \mathbf{x} + b z = w T x + b Where:
w \mathbf{w} w is the weight vectorx \mathbf{x} x is the input vector b b b is the bias termz z z is the pre-activation outputActivation Functions Sigmoid Function The sigmoid function maps any real number to a value between 0 and 1:
σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1}{1 + e^{-z}} σ ( z ) = 1 + e − z 1 Its derivative is particularly elegant:
d σ d z = σ ( z ) ( 1 − σ ( z ) ) \frac{d\sigma}{dz} = \sigma(z)(1 - \sigma(z)) d z d σ = σ ( z ) ( 1 − σ ( z ) ) ReLU (Rectified Linear Unit) The ReLU function is defined as:
ReLU ( z ) = max ( 0 , z ) = { z if z > 0 0 if z ≤ 0 \text{ReLU}(z) = \max(0, z) = \begin{cases} z & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases} ReLU ( z ) = max ( 0 , z ) = { z 0 if z > 0 if z ≤ 0 Forward Propagation For a multi-layer network, forward propagation can be expressed as:
a ( l ) = f ( l ) ( W ( l ) a ( l − 1 ) + b ( l ) ) \mathbf{a}^{(l)} = f^{(l)}(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}) a ( l ) = f ( l ) ( W ( l ) a ( l − 1 ) + b ( l ) ) Where:
a ( l ) \mathbf{a}^{(l)} a ( l ) is the activation of layer l l l W ( l ) \mathbf{W}^{(l)} W ( l ) is the weight matrix for layer l l l b ( l ) \mathbf{b}^{(l)} b ( l ) is the bias vector for layer l l l f ( l ) f^{(l)} f ( l ) is the activation function for layer l l l Loss Functions Mean Squared Error For regression tasks, we often use MSE:
MSE = 1 n ∑ i = 1 n ( y i − y ^ i ) 2 \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 MSE = n 1 i = 1 ∑ n ( y i − y ^ i ) 2 Cross-Entropy Loss For classification, cross-entropy is common:
CE = − ∑ i = 1 C y i log ( y ^ i ) \text{CE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i) CE = − i = 1 ∑ C y i log ( y ^ i ) Backpropagation The heart of neural network training is backpropagation, which uses the chain rule:
∂ L ∂ w i j ( l ) = ∂ L ∂ z j ( l ) ∂ z j ( l ) ∂ w i j ( l ) \frac{\partial L}{\partial w_{ij}^{(l)}} = \frac{\partial L}{\partial z_j^{(l)}} \frac{\partial z_j^{(l)}}{\partial w_{ij}^{(l)}} ∂ w i j ( l ) ∂ L = ∂ z j ( l ) ∂ L ∂ w i j ( l ) ∂ z j ( l ) The gradient of the loss with respect to the pre-activation is:
δ j ( l ) = ∂ L ∂ z j ( l ) \delta_j^{(l)} = \frac{\partial L}{\partial z_j^{(l)}} δ j ( l ) = ∂ z j ( l ) ∂ L And the weight update rule becomes:
w i j ( l ) ← w i j ( l ) − α ∂ L ∂ w i j ( l ) w_{ij}^{(l)} \leftarrow w_{ij}^{(l)} - \alpha \frac{\partial L}{\partial w_{ij}^{(l)}} w i j ( l ) ← w i j ( l ) − α ∂ w i j ( l ) ∂ L Where α \alpha α is the learning rate.
Gradient Descent Optimization The basic gradient descent update rule:
θ t + 1 = θ t − α ∇ θ L ( θ t ) \theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t) θ t + 1 = θ t − α ∇ θ L ( θ t ) Momentum Adding momentum helps with convergence:
v t + 1 = β v t + α ∇ θ L ( θ t ) θ t + 1 = θ t − v t + 1 \begin{aligned} v_{t+1} &= \beta v_t + \alpha \nabla_\theta L(\theta_t) \\ \theta_{t+1} &= \theta_t - v_{t+1} \end{aligned} v t + 1 θ t + 1 = β v t + α ∇ θ L ( θ t ) = θ t − v t + 1 Adam Optimizer The Adam optimizer combines momentum with adaptive learning rates:
m t = β 1 m t − 1 + ( 1 − β 1 ) ∇ θ L ( θ t ) v t = β 2 v t − 1 + ( 1 − β 2 ) ( ∇ θ L ( θ t ) ) 2 m ^ t = m t 1 − β 1 t v ^ t = v t 1 − β 2 t θ t + 1 = θ t − α v ^ t + ϵ m ^ t \begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta L(\theta_t) \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta L(\theta_t))^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \\ \hat{v}_t &= \frac{v_t}{1 - \beta_2^t} \\ \theta_{t+1} &= \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \end{aligned} m t v t m ^ t v ^ t θ t + 1 = β 1 m t − 1 + ( 1 − β 1 ) ∇ θ L ( θ t ) = β 2 v t − 1 + ( 1 − β 2 ) ( ∇ θ L ( θ t ) ) 2 = 1 − β 1 t m t = 1 − β 2 t v t = θ t − v ^ t + ϵ α m ^ t Conclusion Understanding the mathematics behind neural networks gives us deeper insight into how they work and how to improve them. These equations form the foundation for more advanced architectures like transformers, CNNs, and RNNs.