# Perceptron

Perceptron is a linear function or a threshold unit that computes halfplanes.

Perceptron

Below, you can compute the X1 axis bound by using the simple linear function

X1 * W1 = theta

X1 = theta/W1

Using the same rule, X2 = theta/W2

Anything on one side of the halfplane is activated and anything on the other side isn't activated which is the concept of perceptron.

Perceptron halfplane

But setting the weights manually isn't realistic for a whole network. Therefore, we have ways to set the weights.

Perceptron rule: Finite time for linearly separable cases

# Perceptron Training

Perceptron training

You're essentially updating the weights as you iterate.

  • is the weight for the ith perceptron in a neural network layer
  • The learning rate is going to adjust how big the weight adjustments are going to be i.e. how fast to learn. Bigger the learning rate, the bigger the weight deltas will be, as a result it might not converge to the optimal point faster.
  • The is the difference betweeen the target value the perceptron should've predicted vs what it actually predicted,
  • is the input value

# Gradient Descent

While perceptron training is great, but we need a training approach that is more robust to non-linear separability (when the cases aren't linear). That's where gradient descent comes into play.

Comparison of learning rate

Using Sigmoid as threshold addresses this non-differentiable nature of threshold for gradient descent.

Gradient descent can sometimes get stuck in local optima, some way to counteract that is

  • momentum (to move out of a local optima by progressing in the direction)
  • higher order derivatives
  • randomized optimization
  • penalty for "complexity"
    • regression overfitting
    • large tree overfitting
    • more nodes, more layers and larger numbers can cause overfitting

# Back propagation

Input flows from first layer to output layer and error flows backward from output layer to adjust the weights.

# Restriction bias

Representational power of the data structure we're using (in this case neurons), set of hypothesis we're willing to consider.

For example - using perceptron would be limiting to hyperplanes. Sigmoids are much more complex but not much restriction.

# Preference bias

Algorithm's selection of one representation over another.

For example - decision trees prefer trees where nodes at the top have high information gain, shorter trees.

In neural network, initial weights are initialized with small random values since they:

  • avoid local minimum and
  • allow variability for when model is run again so it doesn't always get stuck at the same place
  • low complexity and simpler explanations