Determine the number of steps for each training run

There are two types of training workloads:

  • compute-bound
  • not compute bound

Compute-bound training is limited by how much time you can spend on training, not by how much training data you have or some other factor. In other words, the "optimal" training time is always "as long as you can afford." If you somehow can train longer or more efficiently, training loss should drop. (With proper tuning, validation loss should also drop.)

Speeding up compute-bound training is equivalent to improving training. That said, just because a workload is compute-limited doesn't mean training longer or faster is the only way to improve results.

When training is not compute-bound, you can afford to train as long as you would like to. However, training a model longer might not help much or might even cause overfitting. When training is not compute-bound:

  • You can train to very low training loss, to the point where additional training might slightly reduce the training loss but does not meaningfully reduce the validation loss.
  • You can tune more easily, especially when tuning learning rate decay schedules, since they have a particularly strong interaction with the training budget. In contrast, getting a low training loss on compute-bound training might require a learning rate decay schedule tuned to perfection.

Regardless of whether a given workload is compute-bound or not, methods that increase the variance of the gradients (across batches) typically slow training progress, and thus may increase the number of training steps required to reach a particular validation loss. Any of the following can cause a high gradient variance:

  • Using a smaller batch size.
  • Adding data augmentation.
  • Adding some types of regularization (for example, dropout regularization).

Decide how long to train when training is not compute-bound

Your Goal: Train long enough for the model to reach the best possible result without wasting training steps.

Your main goal is to ensure that you train long enough for the model to reach the best possible result without wasting unnecessary training steps. When in doubt, err on the side of training longer. Your evaluation metrics (for example, precision, recall, AUC, or F1) should never degrade when training longer, assuming you properly use retrospective checkpoint selection and you checkpoint frequently enough.

Never tune the max_train_steps number in a study. Instead, pick a value and use that same value for all trials. From these trials, plot the training step that retrospective checkpoint selection finds in order to refine the choice of max_train_steps.

For example, if the best step is always during the first 10% of training, then the maximum number of steps is way too high. Alternatively, if the best step is consistently in the last 25% of training, you might benefit from training longer and retuning the decay schedule. The ideal number of training steps can change when the architecture or data changes (for example, adding data augmentation). The next section describes how to pick an initial candidate value for max_train_steps based on the number of steps necessary to "perfectly fit" the training set using a constant learning rate.

It may be possible to decrease max_train_steps if the training process improves somehow; for example, with a better-tuned optimizer or a better-tuned learning rate schedule.

Algorithm for picking an initial candidate for max_train_steps using a learning rate sweep

You can pick an initial candidate for max_train_steps with a learning rate sweep algorithm. The following algorithm assumes it is possible to not only "perfectly" fit the training set, but also to do so using a constant learning rate schedule.

  1. If it is possible to perfectly fit the entire training set, then there must exist a configuration (with some value of max_train_steps) that perfectly fits the training set. Find any such configuration and use its value of max_train_steps as a starting point N.
  2. Run a constant learning rate sweep (that is, grid search the learning rate) without data augmentation and without regularization where each trial trains for N steps. The number of steps required for the fastest trial in the learning rate sweep to reach perfect training performance should be your initial guess for max_train_steps.

NOTE: Bad search spaces can lead to self-deception. For example, if all the learning rates in a study are too small, you might incorrectly conclude that a very large value of max_train_steps is necessary. At a minimum, check that the optimal learning rate in the study is not at the boundary of the search space.

Decide how long to train when training is compute-bound

In some cases, training loss keeps improving indefinitely, so your patience and your computational resources become the limiting factors. But should you train as long as you can afford? Not necessarily. Consider the following:

  • You might be able to tune more effectively by running a larger number of shorter experiments, reserving the longest "production length" runs for the models you hope to launch.
  • As the training time for trials approaches your patience limit, tuning experiments become more relevant for your potential launch candidates, but you can complete fewer of them.
  • You can probably answer many questions while only training for ~10% of the production length. However, your conclusions at this time limit might not apply to experiments at 20% of the production length, let alone 100%.

Tuning during multiple rounds with increasing, per-trial training step limits is a sensible approach. You can run as many rounds as you want, but usually 1-3 rounds are the most practical. Essentially, try to obtain as much understanding of the problem as possible using trials with a very quick turnaround time, trading off the following:

  • Tuning thoroughness.
  • Relevance to the final, longest runs.

Once a given per-trial time limit has generated useful insights, increase the training time and continue tuning, double-checking your conclusions from the shorter runs as needed. As a starting point, we recommend two rounds of tuning:

  • Round 1: Shorter duration runs to find good model and optimizer hyperparameters.
  • Round 2: Very few long duration runs on good hyperparameter points to get the final model.

The biggest question going from Round 1 to Round 2 is:

How do you adjust learning rate decay schedules.

One common pitfall when adjusting learning rate schedules between rounds is using all the extra training steps with a learning rate that is too small.

Round 1: a lot of short training runs

Unfortunately, there is no guarantee that good hyperparameters found in short, incomplete training are still good choices when you significantly increase training length. However, for some hyperparameters, the good choices are often correlated enough for Round 1 to be useful. What hyperparameter values found in shorter runs transfer successfully to longer training runs? We don't know; we need more research. But based on what we know so far, here are our suspicions in decreasing probability of transferring:

  • Very likely to transfer. Early training instability can be resolved in the first round of tuning using a smaller number of training steps. The following hyperparameters are the likeliest to transfer:
    • Warmup length
    • Initialization
  • Likely to transfer. A dramatic win in the model architecture usually transfers, but many counterexamples are probable.
  • Might transfer. The following hyperparameters might transfer:
    • Optimization algorithm and hyperparameters would "loosely" transfer.
    • Data augmentation.
    • Regularization. If it isn't possible to perfectly fit the training set, the model might be in a regime where regularization is unlikely to help very much.
  • Unlikely to transfer. Learning rate schedule is unlikely to transfer perfectly. Training Compute-Optimal Large Language Models suggests that even decay schedule transfers, but we don't believe this is true in general. For example, tuning sqrt decay on a small number of training steps and then extending to a large number causes the majority of training to occur at overly small steps. You can likely do "good enough" with most schedules in the limit of extreme training budget, but you will likely see noticeable performance improvements if it is tuned. Understanding Short-Horizon Bias in Stochastic Meta-Optimization describes the dangers of trying to pick learning rates myopically.

Round 2: fewer runs, but of longer duration

Run the best hyperparameter configuration from Round 1.

Speculation: 🤖 Use the extra steps to extend the period of training at a high learning rate. For example, if you are using a linear schedule, then keep the length of the decay fixed from Round 1 and extend the period of constant lr in the beginning. For cosine decay, keep the base lr from Round 1 and extend max_train_steps as described in Training Compute-Optimal Large Language Models.

Additional training rounds might make sense for teams with all of the following:

  • Very mature modeling
  • Tuning pipelines
  • Very long and expensive production training runs

However, additional training runs are often unproductive.

We've already described how to transfer from Round 1 to Round 2. If you don't care about analysis time and if making efficient use of computing resources is your overriding concern, then we recommend exponentially increasing the length of training runs (and thus the end-to-end time to complete a study) over many different rounds of tuning:

  • At each round, systematically ensure that your choices continue to provide good results.
  • Put new ideas through a pipeline that progressively derisks them using increasingly long-running experiments from Step i to Step i+1.