Google is committed to advancing racial equity for Black communities. See how.

# ML Systems in the Real World: Literature

In this lesson, you'll debug a real-world ML problem* related to 18th century literature.

# Real World Example: 18th Century Literature

• Professor of 18th Century Literature wanted to predict the political affiliation of authors based only on the "mind metaphors" the author used.
• Professor of 18th Century Literature wanted to predict the political affiliation of authors based only on the "mind metaphors" the author used.
• Team of researchers made a big labeled data set with many authors' works, sentence by sentence, and split into train/validation/test sets.
• Professor of 18th Century Literature wanted to predict the political affiliation of authors based only on the "mind metaphors" the author used.
• Team of researchers made a big labeled data set with many authors' works, sentence by sentence, and split into train/validation/test sets.
• Trained model did nearly perfectly on test data, but researchers felt results were suspiciously accurate. What might have gone wrong?

Why do you think test accuracy was suspiciously high? See if you can figure out the problem, and then click the Play button ▶ below to find out if you're correct.

• Data Split A: Researchers put some of each author's examples in training set, some in validation set, some in test set.
All of Richardson's examples might be in the training set, while all of Swift's examples might be in the validation set.
• Data Split B: Researchers put all of each author's examples in a single set.
• Data Split A: Researchers put some of each author's examples in training set, some in validation set, some in test set.
• Data Split B: Researchers put all of each author's examples in a single set.
• Results: The model trained on Data Split A had much higher accuracy than the model trained on Data Split B.

The moral: carefully consider how you split examples.

Know what the data represents.

* We based this module very loosely (making some modifications along the way) on "Meaning and Mining: the Impact of Implicit Assumptions in Data Mining for the Humanities" by Sculley and Pasanek.