# Perceptron
Perceptron is a linear function or a threshold unit that computes halfplanes.
Below, you can compute the X1 axis bound by using the simple linear function
X1 * W1 = theta
X1 = theta/W1
Using the same rule, X2 = theta/W2
Anything on one side of the halfplane is activated and anything on the other side isn't activated which is the concept of perceptron.
But setting the weights manually isn't realistic for a whole network. Therefore, we have ways to set the weights.
Perceptron rule: Finite time for linearly separable cases
# Perceptron Training
You're essentially updating the weights as you iterate.
is the weight for the ith perceptron in a neural network layer - The learning rate is going to adjust how big the weight adjustments are going to be i.e. how fast to learn. Bigger the learning rate, the bigger the weight deltas will be, as a result it might not converge to the optimal point faster.
- The
is the difference betweeen the target value the perceptron should've predicted vs what it actually predicted, is the input value
# Gradient Descent
While perceptron training is great, but we need a training approach that is more robust to non-linear separability (when the cases aren't linear). That's where gradient descent comes into play.
Using Sigmoid as threshold addresses this non-differentiable nature of threshold for gradient descent.
Gradient descent can sometimes get stuck in local optima, some way to counteract that is
- momentum (to move out of a local optima by progressing in the direction)
- higher order derivatives
- randomized optimization
- penalty for "complexity"
- regression overfitting
- large tree overfitting
- more nodes, more layers and larger numbers can cause overfitting
# Back propagation
Input flows from first layer to output layer and error flows backward from output layer to adjust the weights.
# Restriction bias
Representational power of the data structure we're using (in this case neurons), set of hypothesis we're willing to consider.
For example - using perceptron would be limiting to hyperplanes. Sigmoids are much more complex but not much restriction.
# Preference bias
Algorithm's selection of one representation over another.
For example - decision trees prefer trees where nodes at the top have high information gain, shorter trees.
In neural network, initial weights are initialized with small random values since they:
- avoid local minimum and
- allow variability for when model is run again so it doesn't always get stuck at the same place
- low complexity and simpler explanations