This page describes a few example use cases for the Prediction API, and how to design training data for each.
Imagine a site that sells beer, wine, and cheese, and you want to predict whether a visitor will be interested in wine, given their purchase history. In this situation, we might create training data with three features:
- The number of times the customer has bought fancy dessert cheese.
- A value of 1 if the customer has ever bought wine, and 0 if not.
- The number of times the customer has bought beer.
A sample of training data for this problem might be encoded as follows:
The first instance (row) would then encode an example where a customer who likes wine has done the following:
- Bought fancy dessert cheese 5 times.
- Bought wine at least once in the past (value of 1).
- Bought beer once.
It's relevant to note that some of the features are ambiguous—not everyone who buys fancy dessert cheese buys wine; some features are negatively correlated with buying wine (for example, perhaps beer drinkers are somewhat less likely to buy wine). Both types of features are legitimate and useful for the prediction system.
Spam Comment Detection
The blog moderation example is example code that tries to detect whether user-submitted comments to a web page are actual comments, or just spam. This is a similar task to spam email detection, but you have a lot fewer signals to go on. The available data is the comment itself, typically. The end-to-end spam comment detection sample uses only the comment of the text itself to try to detect spam. This sample is for an imaginary physics website.
To get a set of spam comments, it pulls a list of spam comments hosted on the website ilps.science.uva.nl. To generate a list of non-spam comments, it parses several physics textbooks into sentences. The final training file is a two-column table like this:
"spam" | "ham","comment string"
To enhance the accuracy, you might add additional features such as number of links in the comments, or if comments require users to log in, user names.
Although you could use a regression model and assign number values to indicate spamminess of each example, instead of using the categories "spam" or "ham", you would then have to evaluate a spamminess value for a tremendous number of examples, and you would have to ensure that these ratings were consistent. This would be both very difficult and very time consuming.