Dense Layer¶
How do we actually initialize a layer for a New Neural Network?
initialization of weights with small random values
why? because according to Andrew Ng’s explanation if all the weights/params are initialized by zero or same value then all the hidden units will be symmetric with identical nodes.
With identical nodes there will be no learning/ decision making. because all the decisions shares same value.
If all the nodes will have zero values(weights are zero , multiplication with weights will also be zero) and propogation result wont be a conclusive one(dead network).
initialization of bias can be zero.
as randomness is already introduced by weights. But for smaller Neural Network it is advised to not to initialize with zero.
\begin{align*} X &= \begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \dots & x_1^{(m)}\\ x_2^{(1)} & x_2^{(2)} & \dots & x_2^{(m)}\\ & & \vdots \\ x_n^{(1)} & x_n^{(2)} & \dots & x_n^{(m)}\\ \end{bmatrix}_{n \times m}\\ W &= \begin{bmatrix} w_1^{(1)} & w_1^{(2)} & \dots & w_1^{(m)}\\ w_2^{(1)} & w_2^{(2)} & \dots & w_2^{(m)}\\ & & \vdots \\ w_n^{(1)} & w_n^{(2)} & \dots & w_n^{(m)}\\ \end{bmatrix}_{n \times m}\\ b &= \begin{bmatrix} b_1 & b_2 & \dots & b_n \end{bmatrix}_{1 \times n}\\ Z &= X W^T + b\\ \\ &=\begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \dots & x_1^{(m)}\\ x_2^{(1)} & x_2^{(2)} & \dots & x_2^{(m)}\\ & & \vdots \\ x_n^{(1)} & x_n^{(2)} & \dots & x_n^{(m)}\\ \end{bmatrix}_{n \times m} \begin{bmatrix} w_1^{(1)} & w_2^{(1)} & \dots & w_n^{(1)}\\ w_1^{(2)} & w_2^{(2)} & \dots & w_n^{(2)}\\ & & \vdots \\ w_1^{(m)} & w_2^{(m)} & \dots & w_n^{(m)}\\ \end{bmatrix}_{m \times n}+ \begin{bmatrix} b_1 & b_2 & \dots & b_n \end{bmatrix}_{1 \times n}\\ \\ &= \begin{bmatrix} x_1^{(1)}w_1^{(1)}+ x_1^{(2)}w_1^{(2)} +\dots+x_1^{(m)}w_1^{(m)} & \dots & x_1^{(1)}w_n^{(1)}+ x_1^{(2)}w_n^{(2)} +\dots+x_1^{(m)}w_n^{(m)} \\ x_2^{(1)}w_1^{(1)}+ x_2^{(2)}w_1^{(2)} +\dots+x_2^{(m)}w_1^{(m)} & \dots & x_2^{(1)}w_n^{(1)}+ x_2^{(2)}w_n^{(2)} +\dots+x_2^{(m)}w_n^{(m)} \\ & \vdots \\ x_n^{(1)}w_1^{(1)}+ x_n^{(2)}w_1^{(2)} +\dots+x_n^{(m)}w_1^{(m)} & \dots & x_n^{(1)}w_n^{(1)}+ x_n^{(2)}w_n^{(2)} +\dots+x_n^{(m)}w_n^{(m)} \end{bmatrix}_{n \times n} + \begin{bmatrix} b_1 & b_2 & \dots & b_n\\ b_1 & b_2 & \dots & b_n\\ & & \vdots\\ b_1 & b_2 & \dots & b_n\\ \end{bmatrix}_{n \times n \text{ broadcasting}}\\ \\ &= \begin{bmatrix} x_1^{(1)}w_1^{(1)}+ x_1^{(2)}w_1^{(2)} +\dots+x_1^{(m)}w_1^{(m)} + b_1 & \dots & x_1^{(1)}w_n^{(1)}+ x_1^{(2)}w_n^{(2)} +\dots+x_1^{(m)}w_n^{(m)}+ b_n \\ x_2^{(1)}w_1^{(1)}+ x_2^{(2)}w_1^{(2)} +\dots+x_2^{(m)}w_1^{(m)} + b_1 & \dots & x_2^{(1)}w_n^{(1)}+ x_2^{(2)}w_n^{(2)} +\dots+x_2^{(m)}w_n^{(m)}+ b_n \\ & \vdots \\ x_n^{(1)}w_1^{(1)}+ x_n^{(2)}w_1^{(2)} +\dots+x_n^{(m)}w_1^{(m)} + b_1 & \dots & x_n^{(1)}w_n^{(1)}+ x_n^{(2)}w_n^{(2)} +\dots+x_n^{(m)}w_n^{(m)} + b_n \end{bmatrix}_{n \times n} \end{align*}
Forward¶
\begin{align*} Z^{[1]} &= A^{[0]} W^{[1]T} + b^{[1]}\\ A^{[1]} &= g^{[1]}(Z^{[1]})\\ \\ Z^{[2]} &= A^{[1]} W^{[2]T} + b^{[2]}\\ A^{[2]} &= g^{[2]}(Z^{[2]})\\ \end{align*}
Generalized
\begin{align*}
Z^{[l]} &= A^{[l-1]} W^{[l]T} + b^{[l]}\\
A^{[l]} &= g^{[l]}(Z^{[l]})
\end{align*}
[1]:
from abc import ABC,abstractmethod
import numpy as np
import matplotlib.pyplot as plt
lets take two layers
lets take layer 1 as input layer. This means input is x or \(a^{[0]}\)
lets take 3 columns = number of nodes = \(n^{[0]} = 3\)
and take 10 samples = m = 10
shape of \(a^{[0]} = (n^{[0]},m)\) (3, 10)
shape of \(w^{[1]} = (n^{[0]},m) = dw^{[1]}\) (3, 10)
shape of \(b^{[1]} = (1, n^{[0]}) = db^{[1]}\) (1, 3)
shape of \(z^{[1]} = (n^{[0]},m) (m, n^{[0]}) + (1, n^{[0]}) = (n^{[0]}, n^{[0]}) = dz^{[1]}\) (3, 10) (10, 3)+ (1, 3) = (3, 3)
shape of \(z^{[1]}\) = shape of \(a^{[1]} = (n^{[0]}, n^{[0]})\) (3, 3)
lets take layer 2 the next layer to that. The first one in hidden layer. Input to this layer is \(a^{[1]}\)
lets take number of nodes in the layer = 5 = \(n^{[1]} = 5\)
shape of \(w^{[2]} = (n^{[1]},n^{[0]}) = dw^{[2]}\) (5 ,3)
shape of \(b^{[2]} = (1, n^{[1]}) = db^{[2]}\) (1, 5)
shape of \(z^{[2]} = (n^{[0]}, n^{[0]}) ( n^{[0]}, n^{[1]}) + (1, n^{[1]}) = (n^{[0]},n^{[1]}) = dz^{[2]}\) (3, 3) (3, 5) + (1, 5) = (3, 5)
[2]:
n0 = 3
n1 = 5
m = 10
[3]:
a0 = np.random.random((n0, m))
w1 = np.random.random((n0, m))
b1 = np.random.random((1, n0))
print(w1.shape, a0.shape,'+', b1.shape)
(3, 10) (3, 10) + (1, 3)
[4]:
z1 = (a0 @ w1.T) + b1
z1.shape
[4]:
(3, 3)
[5]:
a1 = 1/(1 + np.exp(-z1))
a1.shape
[5]:
(3, 3)
[6]:
w2 = np.random.random((n1, n0))
b2 = np.random.random((1, n1))
print(w2.shape, a1.shape,'+', b2.shape)
(5, 3) (3, 3) + (1, 5)
[7]:
z2 = (a1 @ w2.T) + b2
z2.shape
[7]:
(3, 5)
[8]:
a2 = 1/(1 + np.exp(-z2))
a2.shape
[8]:
(3, 5)
Backward¶
\begin{align*} & \text{param for this layer (this function starts working from here)}\\ dW &= dZ' .A^T\\ dB &= \sum(dZ')\\ \\ & \text{input for next layer (in backward propogation)}\\ dZ &= dZ' .W^T \end{align*}
[9]:
dz2 = np.random.random((n0,n1))
dz2.shape
[9]:
(3, 5)
[10]:
dw2 = dz2 @ a2.T
dw2.shape
[10]:
(3, 3)
[11]:
db2 = dz2.sum(axis=0,keepdims=True)
db2.shape
[11]:
(1, 5)
[12]:
dz1 = dz2 @ w2
dz1.shape
[12]:
(3, 3)
[13]:
dw1 = dz1 @ a1.T
dw1.shape
[13]:
(3, 3)
[14]:
db1 = dz1.sum(axis=0,keepdims=True)
db1.shape
[14]:
(1, 3)
[15]:
dz1 @ w1
[15]:
array([[1.00430696, 1.12665459, 1.27528356, 0.37028909, 1.83008842,
0.86290497, 1.23745471, 1.23044548, 0.83923269, 1.65279249],
[0.77372465, 0.99710549, 1.00752794, 0.2431575 , 1.27532378,
0.7486004 , 1.11498651, 0.84140139, 0.61524338, 1.5975826 ],
[1.57524514, 1.82300126, 2.04126706, 0.56541619, 2.87108659,
1.40560442, 2.03900305, 1.88715055, 1.31532754, 2.74793296]])
Model¶
[16]:
class LayerDense:
"""Layer Module
It is recommended that input data X is scaled(data scaling operations)
so that data is normalized but meaning of the data remains same.
Args:
n_inputs (int) : number of inputs
n_neurons (int) : number of neurons
"""
def __init__(self,n_inputs,n_neurons):
"""
"""
self.w = 0.10 * np.random.randn(n_inputs,n_neurons) # multiply by 0.1 to make it small
self.b = np.zeros((1,n_neurons))
def forward(self, a):
"""forward propogation calculation
"""
self.a = a
self.z = np.dot(self.a,self.w)+self.b
def backward(self, dz):
"""backward pass
"""
# gradient on parameters
self.dw = dz @ self.a.T
self.db = dz.sum(axis=0,keepdims=True)
# gradient on values / input to next layer in backpropogation
self.dz = dz @ self.w