A First Neural Network
In this exercise, we will train our first little neural net. Neural nets will give us a way to learn nonlinear models without the use of explicit feature crosses.
Task 1: The model as given combines our two input features into a single neuron. Will this model learn any nonlinearities? Run it to confirm your guess.
Task 2: Try increasing the number of neurons in the hidden layer from 1 to 2, and also try changing from a Linear activation to a nonlinear activation like ReLU. Can you create a model that can learn nonlinearities? Can it model the data effectively?
Task 3: Try increasing the number of neurons in the hidden layer from 2 to 3, using a nonlinear activation like ReLU. Can it model the data effectively? How model quality vary from run to run?
Task 4: Continue experimenting by adding or removing hidden layers and neurons per layer. Also feel free to change learning rates, regularization, and other learning settings. What is the smallest number of neurons and layers you can use that gives test loss of 0.177 or lower?
Does increasing the model size improve the fit, or how quickly it converges? Does this change how often it converges to a good model? For example, try the following architecture:
- First hidden layer with 3 neurons.
- Second hidden layer with 3 neurons.
- Third hidden layer with 2 neurons.
(Answers appear just below the exercise.)
Click the plus icon for an answer to Task 1.
The Activation is set to Linear, so this model cannot learn any nonlinearities. The loss is very high, and we say the model underfits the data.
Click the plus icon for an answer to Task 2.
The nonlinear activation function can learn nonlinear models. However, a single hidden layer with 2 neurons cannot reflect all the nonlinearities in this data set, and will have high loss even without noise: it still underfits the data. These exercises are nondeterministic, so some runs will not learn an effective model, while other runs will do a pretty good job. The best model may not have the shape you expect!
Click the plus icon for an answer to Task 3.
Playground's nondeterministic nature shines through on this exercise. A single hidden layer with 3 neurons is enough to model the data set (absent noise), but not all runs will converge to a good model.
3 neurons are enough because the XOR function can be expressed as a combination of 3 half-planes (ReLU activation). You can see this from looking at the neuron images, which show the output of the individual neurons. In a good model with 3 neurons and ReLU activation, there will be 1 image with an almost vertical line, detecting X1 being positive (or negative; the sign may be switched), 1 image with an almost horizontal line, detecting the sign of X2, and 1 image with a diagonal line, detecting their interaction.
However, not all runs will converge to a good model. Some runs will do no better than a model with 2 neurons, and you can see duplicate neurons in these cases.
Click the plus icon for an answer to Task 4.
A single hidden layer with 3 neurons can model the data, but there is no redundancy, so on many runs it will effectively lose a neuron and not learn a good model. A single layer with more than 3 neurons has more redundancy, and thus is more likely to converge to a good model.
As we saw, a single hidden layer with only 2 neurons cannot model the data well. If you try it, you can see then that all of the items in the output layer can only be shapes composed of the lines from those two nodes. In this case, a deeper network can model the data set better than the first hidden layer alone: individual neurons in the second layer can model more complex shapes, like the upper-right quadrant, by combining neurons in the first layer. While adding that second hidden layer can still model the data set better than the first hidden layer alone, it might make more sense to add more nodes to the first layer to let more lines be part of the kit from which the second layer builds its shapes.
However, a model with 1 neuron in the first hidden layer cannot learn a good model no matter how deep it is. This is because the output of the first layer only varies along one dimension (usually a diagonal line), which isn't enough to model this data set well. Later layers can't compensate for this, no matter how complex; information in the input data has been irrecoverably lost.
What if instead of trying to have a small network, we had lots of layers with lots of neurons, for a simple problem like this? Well, as we've seen, the first layer will have the ability to try lots of different line slopes. And the second layer will have the ability to accumulate them into lots of different shapes, with lots and lots of shapes on down through the subsequent layers.
By allowing the model to consider so many different shapes through so many different hidden neurons, you've created enough space for the model to start easily overfitting on the noise in the training set, allowing these complex shapes to match the foibles of the training data rather than the generalized ground truth. In this example, larger models can have complicated boundaries to match the precise data points. In extreme cases, a large model could learn an island around an individual point of noise, which is called memorizing the data. By allowing the model to be so much larger, you'll see that it actually often performs worse than the simpler model with just enough neurons to solve the problem.