Suppose you want to develop a supervised machine learning model to predict
whether a given email is "spam" or "not spam." Which of the
following statements are true?
Emails not marked as "spam" or "not spam" are unlabeled examples.
Because our label consists of the values "spam" and "not spam",
any email not yet marked as spam or not spam is an
Words in the subject header will make good labels.
Words in the subject header might make excellent features, but they
won't make good labels.
We'll use unlabeled examples to train the model.
We'll use labeled examples to train the model. We can then
run the trained model against unlabeled examples to infer
whether the unlabeled email messages are spam or not spam.
The labels applied to some examples might be unreliable.
Definitely. It's important to check how reliable your data
is. The labels for this dataset probably come from email
users who mark particular email messages as spam. Since
most users do not mark every suspicious email message as spam, we may
have trouble knowing whether an email is spam. Furthermore,
spammers could intentionally poison our model by providing faulty