Data Split Example

After collecting your data and sampling where needed, the next step is to split your data into training sets, validation sets, and testing sets.

When Random Splitting isn't the Best Approach

While random splitting is the best approach for many ML problems, it isn't always the right solution. For example, consider data sets in which the examples are naturally clustered into similar examples.

Suppose you want your model to classify the topic from the text of a news article. Why would a random split be problematic?

Four separate clusters of articles (labeled "Story 1", "Story 2",
"Story 3", and "Story 4") appear on a
timeline. Figure 1. News Stories are Clustered.

News stories appear in clusters: multiple stories about the same topic are published around the same time. If we split the data randomly, therefore, the test set and the training set will likely contain the same stories. In reality, it wouldn't work this way because all the stories will come in at the same time, so doing the split like this would cause skew.

The same articles from Figure 1 are no longer on a timeline. Instead, the
articles are now randomly divided into
a training set and a testing set. The training set and the testing set
each contain a mix of different examples from all four
stories. Figure 2. A random split will split a cluster across sets, causing skew.

A simple approach to fixing this problem would be to split our data based on when the story was published, perhaps by day the story was published. This results in stories from the same day being placed in the same split.

The original timeline from Figure 1 is now divided into a training
set and a test set. All the articles from "Story 1" and "Story 2"
are in the training set, and all the articles from "Story 3" and "Story 4"
are in the test
set. Figure 3. Splitting on time allows the clusters to mostly end up in the same set.

With tens of thousands or more news stories, a percentage may get divided across the days. That's okay, though; in reality these stories were split across two days of the news cycle. Alternatively, you could throw out data within a certain distance of your cutoff to ensure you don't have any overlap. For example, you could train on stories for the month of April, and then use the second week of May as the test set, with the week gap preventing overlap.