Suppose you want to develop a supervised machine learning model to predict
whether a given email is "spam" or "not spam." Which of the
following statements are true?
Emails not marked as "spam" or "not spam" are unlabeled examples.
Because our label consists of the values "spam" and "not spam",
any email not yet marked as spam or not spam is an
Words in the subject header will make good labels.
Words in the subject header might make excellent features, but they
won't make good labels.
We'll use unlabeled examples to train the model.
We'll use labeled examples to train the model. We can then
run the trained model against unlabeled examples to infer
whether the unlabeled email messages are spam or not spam.
The labels applied to some examples might be unreliable.
Definitely. It's important to check how reliable your data
is. The labels for this dataset probably come from email
users who mark particular email messages as spam. Since
most users do not mark every suspicious email message as spam, we may
have trouble knowing whether an email is spam. Furthermore,
spammers could intentionally poison our model by providing faulty
Features and Labels
Explore the options below.
Suppose an online shoe store wants to create a supervised ML model
that will provide personalized shoe recommendations to users. That is,
the model will recommend certain pairs of shoes to Marty and
different pairs of shoes to Janet. The system will use past user
behavior data to generate training data. Which of the following
statements are true?
"Shoe size" is a useful feature.
"Shoe size" is a quantifiable signal that likely has
a strong impact on whether the user will like the recommended
shoes. For example, if Marty wears size 9, the model shouldn't
recommend size 7 shoes.
"Shoe beauty" is a useful feature.
Good features are concrete and quantifiable.
Beauty is too vague a concept to serve as a useful feature.
Beauty is probably a blend of certain concrete features,
such as style and color. Style and color would each be
better features than beauty.
"The user clicked on the shoe's description" is a useful label.
Users probably only want to read more about those shoes that
they like. Clicks by users is, therefore, an observable, quantifiable
metric that could serve as a good training label. Since our training
data derives from past user behavior, our labels need to derive from
objective behaviors like clicks that strongly correlate with user
"Shoes that a user adores" is a useful label.
Adoration is not an observable, quantifiable metric. The best we can
do is search for observable proxy metrics for adoration.