Guide for starting a new project

This section explains how to pick the following at the start of an ML project:

the model architecture
the optimizer
the batch size
the initial configuration

Assumptions

The advice in this section assumes the following:

You've already formulated the problem, and prepared your training data to some extent.
You've already set up a training and testing pipeline.
You've already selected and implemented metrics that are as representative as possible of what you plan to measure in the deployed environment.

Assuming you've met all the preceding prerequisites, you are now ready to devote time to the model architecture and training configuration.

Choose the model architecture

Let's start with the following definitions:

A model architecture is a system for producing predictions. A model architecture contains the framework for converting input data into predictions, but doesn't contain parameter values. For example, a neural network with three hidden layers of 10 nodes, 5 nodes, and 3 nodes, respectively, is a model architecture.
A model is a model architecture plus specific values for all the parameters. For example, a model consists of the neural network described in the definition of model architecture, plus specific values for the weights and bias of each node.
A model family is a template for constructing a model architecture given a set of hyperparameters.

Choosing the model architecture really means choosing a set of different models (one for each setting of the model hyperparameters).

When possible, try to find a documented codebase that addresses something as close as possible to the current problem. Then, reproduce that model as a starting point.

Choose the optimizer

No optimizer is the "best" across all types of machine learning problems and model architectures. Even just comparing the performance of optimizers is difficult. 🤖We recommend using well-established, popular optimizers, especially when starting a new project.

We recommend choosing the most popular optimizer for the type of problem you are working on. We recommend the following well-established optimizers:

SGD with momentum. We recommend the Nesterov variant.
Adam and NAdam, which are more general than SGD with momentum. Note that Adam has four tunable arguments and they can all matter! See How should Adam's hyperparameters be tuned?.

Pay attention to all arguments of the chosen optimizer. Optimizers with more hyperparameters typically require more tuning effort. This is particularly painful in the beginning stages of a project when you are trying to find the best values of various other hyperparameters (for example, learning rate) while treating optimizer arguments as nuisances. Therefore, we recommend the following approach:

At the start of the project, pick an optimizer without many tunable hyperparameters. Here are two examples:
- SGD with fixed momentum.
- Adam with fixed Epsilon, Beta1, and Beta2.
In later stages of the project, switch to a more general optimizer that tunes more hyperparameters instead of fixing them to default values.

Choose the batch size

Summary: The batch size governs the training speed; don't use batch size to directly tune the validation set performance.

Batch size heavily determines the training time and computing resource consumption. Increasing the batch size often decreases the training time, which:

Lets you tune hyperparameters more thoroughly within a fixed time interval, potentially producing a better final model.
Reduces the latency of the development cycle, allowing new ideas to be tested more frequently.

Increasing the batch size can decrease or increase resource consumption, or leave resource consumption unchanged.

Don't treat the batch size as a tunable hyperparameter for validation set performance. If all of the following conditions are met, model performance shouldn't depend on batch size:

All the optimizer hyperparameters are well-tuned.
Regularization is sufficient and well-tuned.
The number of training steps is sufficient.

The same final performance should be attainable using any batch size (See Shallue et al. 2018 and Why shouldn't the batch size be tuned to directly improve validation set performance?)

Determine the feasible batch sizes and estimate training throughput

For a given model and optimizer, the available hardware typically supports a range of batch sizes. The limiting factor is usually accelerator memory. Unfortunately, it can be difficult to calculate which batch sizes will fit in memory without running, or at least compiling, the full training program. The easiest solution is usually to run training jobs at different batch sizes (for example, increasing powers of 2) for a small number of steps until one of the jobs exceeds the available memory. For each batch size, train long enough to get a reliable estimate of the training throughput:

training throughput = the number of examples processed per second

or, equivalently, the time per step:

time per step = batch size / training throughput

When the accelerators aren't yet saturated, if the batch size doubles, the training throughput should also double (or at least nearly double). Equivalently, the time per step should be constant (or at least nearly constant) as the batch size increases. If this is not the case, then the training pipeline has a bottleneck, such as I/O or synchronization between compute nodes. Consider diagnosing and correcting the bottleneck before proceeding.

If the training throughput increases only up to some maximum batch size, then only consider batch sizes up to that maximum batch size, even if the hardware supports a larger batch size. All benefits of using a larger batch size assume the training throughput increases. If it doesn't, fix the bottleneck or use the smaller batch size.

Gradient accumulation simulates a larger batch size than the hardware can support and therefore does not provide any throughput benefits. You should generally avoid gradient accumulation in applied work.

You might need to repeat these steps every time you change the model or the optimizer. For example, a different model architecture may allow a larger batch size to fit in memory.

Choose the batch size to minimize training time

Here is our definition of training time:

training time = (time per step) x (total number of steps)

You can often consider the time per step to be approximately constant for all feasible batch sizes. This is true when:

There is no overhead from parallel computations.
All training bottlenecks have been diagnosed and corrected. (See the previous section for how to identify training bottlenecks. In practice, there is usually at least some overhead from increasing the batch size.

As the batch size increases, the total number of steps needed to reach a fixed performance goal typically decreases, provided you retune all relevant hyperparameters when changing the batch size. (See Shallue et al. 2018.) For example, doubling the batch size might halve the total number of steps required. This relationship is called perfect scaling and should hold for all batch sizes up to a critical batch size.

Beyond the critical batch size, increasing the batch size produces diminishing returns. That is, increasing the batch size eventually no longer reduces the number of training steps but never increases it. Therefore, the batch size that minimizes training time is usually the largest batch size that still reduces the number of training steps required. This batch size depends on the dataset, model, and optimizer, and it is an open problem how to calculate it other than finding it experimentally for every new problem. 🤖

When comparing batch sizes, beware of the distinction between the following:

An example budget or epoch budget—running all experiments while fixing the number of training example presentations.
A step budget—running all experiments with a fixed number of training steps.

Comparing batch sizes with an epoch budget only probes the perfect scaling regime, even when larger batch sizes might still provide a meaningful speedup by reducing the number of training steps required. Often, the largest batch size supported by the available hardware is smaller than the critical batch size. Therefore, a good rule of thumb (without running any experiments) is to use the largest batch size possible.There is no point in using a larger batch size if it ends up increasing the training time.

Choose the batch size to minimize resource consumption

There are two types of resource costs associated with increasing the batch size:

Upfront costs. For example, purchasing new hardware or rewriting the training pipeline to implement multi-GPU / multi-TPU training.
Usage costs. For example, billing against the team's resource budgets, billing from a cloud provider, electricity / maintenance costs.

If there are significant upfront costs to increasing the batch size, it might be better to defer increasing the batch size until the project has matured and it is easier to assess the cost-benefit tradeoff. Implementing multi-host parallel training programs can introduce bugs and subtle issues so it is probably better to start off with a simpler pipeline anyway. On the other hand, a large speedup in training time might be very beneficial early in the process when a lot of tuning experiments are needed.

We refer to the total usage cost (which may include multiple different kinds of costs) as the resource consumption, calculated as follows:

resource consumption = resource consumption per step x total number of steps

Increasing the batch size usually reduces the total number of steps. Whether the resource consumption increases or decreases depends on how the consumption per step changes, which depends on batch size as follows:

Increasing the batch size might decrease resource consumption. For example, if each step with the larger batch size can be run on the same hardware as the smaller batch size (with only a small increase in time per step), then any increase in the resource consumption per step might be outweighed by the decrease in the number of steps.
Increasing the batch size might not change resource consumption. For example, if doubling the batch size halves the number of steps required and doubles the number of GPUs used, the total consumption (in terms of GPU-hours) does not change.
Increasing the batch size might increase the resource consumption. For example, if increasing the batch size requires upgraded hardware, the increase in consumption per step might outweigh the reduction in the number of steps.

Changing the batch size requires re-tuning most hyperparameters

The optimal values of most hyperparameters are sensitive to the batch size. Therefore, changing the batch size typically requires starting the tuning process all over again. The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are as follows:

The optimizer hyperparameters (for example, learning rate and momentum)
The regularization hyperparameters

Note this when choosing the batch size at the start of a project. If you need to switch to a different batch size later on, it might be difficult, time consuming, and expensive to retune the other hyperparameters for the new batch size.

How batch norm interacts with the batch size

Batch norm is complicated and, in general, should use a different batch size than the gradient computation to compute statistics. See Batch normalization implementation details for a detailed discussion.

Choose the initial configuration

The first stage in hyperparameter tuning is determining starting points for the following:

the model configuration (e.g. number of layers)
the optimizer hyperparameters (e.g. learning rate)
the number of training steps

Determining this initial configuration requires some manually configured training runs and trial-and-error.

Our guiding principle is as follows:

Find a simple, relatively fast, relatively low-resource-consumption configuration that obtains a reasonable performance.

where:

Simple means avoiding unnecessary pipeline features, such as special regularizes or architectural tricks. For example, a pipeline without dropout regularization (or with dropout regularization disabled) is simpler than one with dropout regularization.
Reasonable performance depends on the problem, but at minimum, a reasonable trained model performs much better than random chance on the validation set.

Choosing an initial configuration that is fast and consumes minimal resources makes hyperparameter tuning much more efficient. For example, start with a smaller model.

Choosing the number of training steps involves balancing the following tension:

Training for more steps can improve performance and simplifies hyperparameter tuning. (For more details, see Shallue et al. 2018).
Conversely, training for fewer steps means that each training run is faster and uses fewer resources, boosting tuning efficiency by reducing the time between cycles and allowing you to run more experiments in parallel. Moreover, if you chose an unnecessarily large step budget at the start of the project, it might be hard to change it later in the project; for example, once you've tuned the learning rate schedule for that number of steps.

Introduction

A scientific approach to improving model performance