We will look into how perceptrons are structured and what they actually do. Perceptrons are the first generation of neural networks. Research on perceptrons was conducted since 1960s.
[caption id="attachment_1312" align="aligncenter" width="277"] Feed-forward Neural Network[/caption]
In 2011 Ilya Sutskever designed a recurrent neural network which was generating one character at a time.
[caption id="attachment_1317" align="aligncenter" width="292"] Recurrent Neural Network[/caption]
Statistical Pattern Recognition Approach:
[caption id="attachment_1346" align="aligncenter" width="300"] Standard Perceptron Architecture[/caption]
\begin{equation} z = b + \sum_{i}{x_i w_i} \end{equation}
\begin{equation} y = \begin{cases} 1, if z \geq $\Theta$ \\ 0, otherwise \end{cases} \end{equation}
z - total input calculation
y - output of the neuron
[caption id="attachment_1166" align="aligncenter" width="322"] Binary Threshold Neuron[/caption]
Biases can be treated like weight, and can be trained accordingly. We can provide input "1" to the "bias neuron" and include it in training like any other neuron.
[caption id="attachment_1366" align="aligncenter" width="300"] Binary Threshold Neuron - Training Biases[/caption]
Using this training algorithm it is guaranteed that the right answer will be found for each training case if any such set exists.
If we would imagine the weight space as n-dimensional space where each dimension corresponds to the value of a certain weight, then a point in this space would represent a set of weight values. Training cases in this sense would correspond to the planes.
Inputs can be thought of as constains. This means that inputs constrain the set of weights that give the correct classification results by paritioning the space into two halves.
There are mathematics proofs that learning of perceptrons would lead to a feasible space if there is any feasible space.
Binary threshold neurons cannot sometimes satisfy easiest cases:
Imagine the case of sets of 2D inputs, where we want to classify the same feature inputs. In these terms, the points (0,0) and (1,1) should lead to the answer = 1, while (0,1) and (1,0) would lead to the answer = 0. In this example, there's no single set of weights which would satisfy all the constraints.
We can notice that there's a set of training cases which are not linearly separable and would lead to non-existence of weight plane which can properly classify the input.
Another case which perceptrons could not resolve is the case of discrimination of simple patterns when you translate them with wrap-around.
Based on this, Minskys and Papert's paper "Group Invariance Theorem" (1960s) says that the part of a Perceptron that learns cannot learn to do this if the transformations form a group. (Translations with wrap-around form a group)
Networks withou hidden units are very limited in what they can learn to model. What is required is multiple layers of adaptive, non-linear hidden units. We need the way to adapt all the weights, and not just the last layer.