Training Neural Networks

Backpropagation is the most common training algorithm for neural networks. It makes gradient descent feasible for multi-layer neural networks. TensorFlow handles backpropagation automatically, so you don't need a deep understanding of the algorithm. To get a sense of how it works, walk through the following: Backpropagation algorithm visual explanation. As you scroll through the preceding explanation, note the following:

  • How data flows through the graph.
  • How dynamic programming lets us avoid computing exponentially many paths through the graph. Here "dynamic programming" just means recording intermediate results on the forward and backward passes.

Training Neural Nets

  • Gradients are important
    • If it's differentiable, we can probably learn on it
  • Gradients are important
    • If it's differentiable, we can probably learn on it
  • Gradients can vanish
    • Each additional layer can successively reduce signal vs. noise
    • ReLus are useful here
  • Gradients are important
    • If it's differentiable, we can probably learn on it
  • Gradients can vanish
    • Each additional layer can successively reduce signal vs. noise
    • ReLus are useful here
  • Gradients can explode
    • Learning rates are important here
    • Batch normalization (useful knob) can help
  • Gradients are important
    • If it's differentiable, we can probably learn on it
  • Gradients can vanish
    • Each additional layer can successively reduce signal vs. noise
    • ReLus are useful here
  • Gradients can explode
    • Learning rates are important here
    • Batch normalization (useful knob) can help
  • ReLu layers can die
    • Keep calm and lower your learning rates
  • We'd like our features to have reasonable scales
    • Roughly zero-centered, [-1, 1] range often works well
    • Helps gradient descent converge; avoid NaN trap
    • Avoiding outlier values can also help
  • Can use a few standard methods:
    • Linear scaling
    • Hard cap (clipping) to max, min
    • Log scaling
  • Dropout: Another form of regularization, useful for NNs
  • Works by randomly "dropping out" units in a network for a single gradient step
    • There's a connection to ensemble models here
  • The more you drop out, the stronger the regularization
    • 0.0 = no dropout regularization
    • 1.0 = drop everything out! learns nothing
    • Intermediate values more useful