## A First Neural Network

In this exercise, we will train our first little neural net.
Neural nets will give us a way to learn nonlinear models without
the use of explicit feature crosses.

**Task 1:** The model as given combines our two input features into a
single neuron. Will this model learn any nonlinearities? Run it to confirm your
guess.

**Task 2:** Try increasing the number of neurons in the hidden layer from
1 to 2, and also try changing from a Linear activation to a nonlinear activation
like ReLU. Can you create a model that can learn nonlinearities? Can it model
the data effectively?

**Task 3:** Try increasing the number of neurons in the hidden layer from
2 to 3, using a nonlinear activation like ReLU. Can it model the data
effectively? How model quality vary from run to run?

**Task 4:** Continue experimenting by adding or removing hidden layers
and neurons per layer. Also feel free to change learning rates,
regularization, and other learning settings. What is the *smallest*
number of neurons and layers you can use that gives test loss
of 0.177 or lower?

Does increasing the model size improve the fit, or how quickly it converges?
Does this change how often it converges to a good model? For example, try the
following architecture:

- First hidden layer with 3 neurons.
- Second hidden layer with 3 neurons.
- Third hidden layer with 2 neurons.

(Answers appear just below the exercise.)

####
Click the plus icon for an answer to Task 1.

The Activation is set to `Linear`, so this model cannot learn
any nonlinearities. The loss is very high, and we say the model **underfits**
the data.

####
Click the plus icon for an answer to Task 2.

The nonlinear activation function can learn nonlinear models. However,
a single hidden layer with 2 neurons cannot reflect all the nonlinearities in
this data set, and will have high loss even without noise: it still
**underfits** the data. These exercises are nondeterministic, so some runs
will not learn an effective model, while other runs will do a pretty good job.
The best model may not have the shape you expect!

####
Click the plus icon for an answer to Task 3.

Playground's nondeterministic nature shines through on this exercise. A
single hidden layer with 3 neurons is enough to model the data set (absent
noise), but not all runs will converge to a good model.

3 neurons are enough because the XOR function can be expressed as a combination of 3 half-planes (ReLU activation). You can see this from looking at the
neuron images, which show the output of the individual neurons. In a good model
with 3 neurons and ReLU activation, there will be 1 image with an almost
vertical line, detecting X^{1} being positive (or negative; the sign may
be switched), 1 image with an almost horizontal line, detecting the sign of
X^{2}, and 1 image with a diagonal line, detecting their
interaction.

However, not all runs will converge to a good model. Some runs will do no
better than a model with 2 neurons, and you can see duplicate neurons in these
cases.

####
Click the plus icon for an answer to Task 4.

A single hidden layer with 3 neurons can model the data, but there is no
redundancy, so on many runs it will effectively lose a neuron and not learn a
good model. A single layer with more than 3 neurons has more redundancy, and
thus is more likely to converge to a good model.

As we saw, a single hidden layer with only 2 neurons cannot model the data
well. If you try it, you can see then that all of the items in the output layer
can only be shapes composed of the lines from those two nodes. In this case, a
deeper network can model the data set better than the first hidden layer alone:
individual neurons in the second layer can model more complex shapes, like the
upper-right quadrant, by combining neurons in the first layer. While adding that
second hidden layer can still model the data set better than the first hidden
layer alone, it might make more sense to add more nodes to the first layer to
let more lines be part of the kit from which the second layer builds its
shapes.

However, a model with 1 neuron in the first hidden layer cannot learn a good
model no matter how deep it is. This is because the output of the first
layer only varies along one dimension (usually a diagonal line), which isn't
enough to model this data set well. Later layers can't compensate for this, no
matter how complex; information in the input data has been irrecoverably
lost.

What if instead of trying to have a small network, we had lots of layers with
lots of neurons, for a simple problem like this? Well, as we've seen, the first
layer will have the ability to try lots of different line slopes. And the second
layer will have the ability to accumulate them into lots of different shapes,
with lots and lots of shapes on down through the subsequent layers.

By allowing the model to consider so many different shapes through so many
different hidden neurons, you've created enough space for the model to start
easily **overfitting** on the noise in the training set, allowing these
complex shapes to match the foibles of the training data rather than the
generalized ground truth. In this example, larger models can have complicated
boundaries to match the precise data points. In extreme cases, a large model
could learn an island around an individual point of noise, which is called
**memorizing** the data. By allowing the model to be so much larger, you'll
see that it actually often performs *worse* than the simpler model with
just enough neurons to solve the problem.