Appendix: Batch Training

  • Keras' fit_generator function enables training on very large datasets that exceed memory capacity by processing data in batches.

  • Batching applies data transformations to smaller portions of the dataset, improving efficiency for large datasets like DBPedia, Amazon reviews, Ag news, and Yelp reviews.

  • The provided _data_generator function demonstrates how to create batches of data for use with fit_generator, yielding feature and label data in manageable chunks.

  • When training with fit_generator, steps_per_epoch and validation_steps need to be defined to specify the number of batches needed to cover the entire training and validation datasets, respectively, for one epoch.

Very large datasets may not fit in the memory allocated to your process. In the previous steps, we have set up a pipeline where we bring in the entire dataset in to the memory, prepare the data, and pass the working set to the training function. Instead, Keras provides an alternative training function (fit_generator) that pulls the data in batches. This allows us to apply the transformations in the data pipeline to only a small (a multiple of batch_size) part of the data. During our experiments, we used batching (code in GitHub) for datasets such as DBPedia, Amazon reviews, Ag news, and Yelp reviews.

The following code illustrates how to generate data batches and feed them to fit_generator.

def _data_generator(x, y, num_features, batch_size):
    """Generates batches of vectorized texts for training/validation.

    # Arguments
        x: np.matrix, feature matrix.
        y: np.ndarray, labels.
        num_features: int, number of features.
        batch_size: int, number of samples per batch.

    # Returns
        Yields feature and label data in batches.
    """
    num_samples = x.shape[0]
    num_batches = num_samples // batch_size
    if num_samples % batch_size:
        num_batches += 1

    while 1:
        for i in range(num_batches):
            start_idx = i * batch_size
            end_idx = (i + 1) * batch_size
            if end_idx > num_samples:
                end_idx = num_samples
            x_batch = x[start_idx:end_idx]
            y_batch = y[start_idx:end_idx]
            yield x_batch, y_batch

# Create training and validation generators.
training_generator = _data_generator(
    x_train, train_labels, num_features, batch_size)
validation_generator = _data_generator(
    x_val, val_labels, num_features, batch_size)

# Get number of training steps. This indicated the number of steps it takes
# to cover all samples in one epoch.
steps_per_epoch = x_train.shape[0] // batch_size
if x_train.shape[0] % batch_size:
    steps_per_epoch += 1

# Get number of validation steps.
validation_steps = x_val.shape[0] // batch_size
if x_val.shape[0] % batch_size:
    validation_steps += 1

# Train and validate model.
history = model.fit_generator(
    generator=training_generator,
    steps_per_epoch=steps_per_epoch,
    validation_data=validation_generator,
    validation_steps=validation_steps,
    callbacks=callbacks,
    epochs=epochs,
    verbose=2)  # Logs once per epoch.