Fairness: Check Your Understanding

Types of Bias

Explore the options below.

Which of the following model's predictions have been affected by selection bias?
A German handwriting recognition smartphone app uses a model that frequently incorrectly classifies ß (Eszett) characters as B characters, because it was trained on a corpus of American handwriting samples, mostly written in English.
This model was affected by a type of selection bias called coverage bias: the training data (American English handwriting) was not representative of the type of data provided by the model's target audience (German handwriting).
Engineers built a model to predict the likelihood of a person developing diabetes based on their daily food intake. The model was trained on 10,000 "food diaries" collected from a randomly chosen group of people worldwide representing a variety of different age groups, ethnic backgrounds, and genders. However, when the model was deployed, it had very poor accuracy. Engineers subsequently discovered that food diary participants were reluctant to admit the true volume of unhealthy foods they ate, and were more likely to document consumption of nutritious food than less healthy snacks.
There is no selection bias in this model; participants who provided training data were a representative sampling of users and were chosen randomly. Instead, this model was affected by reporting bias. Ingestion of unhealthy foods was reported at a much lower frequency than true real-world occurrence.
Engineers at a company developed a model to predict staff turnover rates (the percentage of employees quitting their jobs each year) based on data collected from a survey sent to all employees. After several years of use, engineers determined that the model underestimated turnover by more than 20%. When conducting exit interviews with employees leaving the company, they learned that more than 80% of people who were dissatisfied with their jobs chose not to complete the survey, compared to a company-wide opt-out rate of 15%.
This model was affected by a type of selection bias called non-response bias. People who were dissatisfied with their jobs were underrepresented in the training data set because they opted out of the company-wide survey at much higher rates than the entire employee population.
Engineers developing a movie-recommendation system hypothesized that people who like horror movies will also like science-fiction movies. When they trained a model on 50,000 users' watchlists, however, it showed no such correlation between preferences for horror and for sci-fi; instead it showed a strong correlation between preferences for horror and for documentaries. This seemed odd to them, so they retrained the model five more times using different hyperparameters. Their final trained model showed a 70% correlation between preferences for horror and for sci-fi, so they confidently released it into production.
There is no evidence of selection bias, but this model may have instead been affected by experimenter's bias, as the engineers kept iterating on their model until it confirmed their preexisting hypothesis.

Evaluating for Bias

A sarcasm-detection model was trained on 80,000 text messages: 40,000 messages sent by adults (18 years and older) and 40,000 messages sent by minors (less than 18 years old). The model was then evaluated on a test set of 20,000 messages: 10,000 from adults and 10,000 from minors. The following confusion matrices show the results for each group (a positive prediction signifies a classification of "sarcastic"; a negative prediction signifies a classificaton of "not sarcastic"):

Adults

True Positives (TPs): 512 False Positives (FPs): 51
False Negatives (FNs): 36 True Negatives (TNs): 9401
$$\text{Precision} = \frac{TP}{TP+FP} = 0.909$$
$$\text{Recall} = \frac{TP}{TP+FN} = 0.934$$

Minors

True Positives (TPs): 2147 False Positives (FPs): 96
False Negatives (FNs): 2177 True Negatives (TNs): 5580
$$\text{Precision} = \frac{TP}{TP+FP} = 0.957$$
$$\text{Recall} = \frac{TP}{TP+FN} = 0.497$$

Explore the options below.

Which of the following statements about the model's test-set performance are true?
Overall, the model performs better on examples from adults than on examples from minors.

The model achieves both precision and recall rates over 90% when detecting sarcasm in text messages from adults.

While the model achieves a slightly higher precision rate for minors than adults, the recall rate is substantially lower for minors, resulting in less reliable predictions for this group.

The model fails to classify approximately 50% of minors' sarcastic messages as "sarcastic."
The recall rate of 0.497 for minors indicates that the model predicts "not sarcastic" for approximately 50% of minors' sarcastic texts.
Approximately 50% of messages sent by minors are classified as "sarcastic" incorrectly.
The precision rate of 0.957 indicates that over 95% of minors' messages classified as "sarcastic" are actually sarcastic.
The 10,000 messages sent by adults are a class-imbalanced dataset.
If we compare the number of messages from adults that are actually sarcastic (TP+FN = 548) with the number of messages that are actually not sarcastic (TN + FP = 9452), we see that "not sarcastic" labels outnumber "sarcastic" labels by a ratio of approximately 17:1.
The 10,000 messages sent by minors are a class-imbalanced dataset.
If we compare the number of messages from minors that are actually sarcastic (TP+FN = 4324) with the number of messages that are actually not sarcastic (TN + FP = 5676), we see that there is a 1.3:1 ratio of "not sarcastic" labels to "sarcastic" labels. Given that the distribution of labels between the two classes is quite close to 50/50, this is not a class-imbalanced dataset.

Explore the options below.

Engineers are working on retraining this model to address inconsistencies in sarcasm-detection accuracy across age demographics, but the model has already been released into production. Which of the following stopgap strategies will help mitigate errors in the model's predictions?
Restrict the model's usage to text messages sent by adults.

The model performs well on text messages from adults (with precision and recall rates both above 90%), so restricting its use to this group will sidestep the systematic errors in classifying minors' text messages.

When the model predicts "not sarcastic" for text messages sent by minors, adjust the output so the model returns a value of "unsure" instead.

The precision rate for text messages sent by minors is high, which means that when the model predicts "sarcastic" for this group, it is nearly always correct.

The problem is that recall is very low for minors; The model fails to identify sarcasm in approximately 50% of examples. Given that the model's negative predictions for minors are no better than random guesses, we can avoid these errors by not providing a prediction in these cases.

Restrict the model's usage to text messages sent by minors.

The systematic errors in this model are specific to text messages sent by minors. Restricting the model's use to the group more susceptible to error would not help.

Adjust the model output so that it returns "sarcastic" for all text messages sent by minors, regardless of what the model originally predicted.

Always predicting "sarcastic" for minors' text messages would increase the recall rate from 0.497 to 1.0, as the model would no longer fail to identify any messages as sarcastic. However, this increase in recall would come at the expense of precision. All the true negatives would be changed to false positives:

True Positives (TPs): 4324 False Positives (FPs): 5676
False Negatives (FNs): 0 True Negatives (TNs): 0

which would decrease the precision rate from 0.957 to 0.432. So, adding this calibration would change the type of error but would not mitigate the magnitude of the error.