ML Practicum: Fairness in Perspective API

Check Your Understanding: Identifying and Remediating Bias

Identifying Bias

In Exercise #1: Explore the Model, you confirmed that the model was disproportionately classifying comments with identity terms as toxic. Which metrics help explain the cause of this bias? Explore the options below.

Accuracy measures the percent of total predictions that are correct—the percent of predictions that are true positives or true negatives. Comparing accuracy for different subgroups (such as different gender demographics) lets us evaluate the model's relative performance for each group and can serve as an indicator of the effect of bias on a model.

However, because accuracy considers correct and incorrect predictions in aggregate, it doesn't distinguish between the two types of correct predictions and the two types of incorrect predictions. Looking at accuracy alone, we can't determine the underlying breakdowns of true positives, true negatives, false positives, and false negatives, which would provide more insight into the source of the bias.

False positive rate

False positive rate (FPR) is the percentage of actual-negative examples (nontoxic comments) that were incorrectly classified as positives (toxic comments). FPR is an indicator of the effect of bias on the model. When we compare the FPRs for different subgroups (such as different gender demographics), we learn that text comments that contain identity terms related to gender are more likely to be incorrectly classified as toxic (false positives) than comments that don't contain these terms.

However, we're not looking to measure the effect of the bias; we want to find its cause. To do so, we need to take a closer look into the inputs to the FPR formula.

Actual negatives and actual positives
In this model's training and test datasets, Actual positives are all the examples of comments that are toxic, and actual negatives all the examples that are nontoxic. Given that identity terms themselves are neutral, we'd expect a balanced number of actual-negative and actual-positive comments containing a given identity term. If we see a disproportionally low number of actual negatives, that tells us that the model didn't see very many examples of identity terms used in positive or neutral contexts. In that case, the model might learn a correlation between identity terms and toxicity.
Recall is the percentage of actual positive predictions that were correctly classified as positives. It tells us the percentage of toxic comments the model successfully caught. Here, we're concerned with bias related to false positives (nontoxic comments that were classified as toxic), and recall doesn't provide any insight into this problem.

Remediating Bias

Which of the following actions might be effective methods of remediating bias in the training data used in Exercise #1 and Exercise #2? Explore the options below.
Add more negative (nontoxic) examples containing identity terms to the training set.
Adding more negative examples (comments that are actually nontoxic) that contain identity terms will help balance the training set. The model will then see a better balance of identity terms used in toxic and nontoxic contexts, so that it can learn that the terms themselves are neutral.
Add more positive (toxic) examples containing identity terms to the training set.
Toxic examples are already overrepresented in the subset of examples containing identity terms. If we add even more of these examples to the training set, we'll actually be exacerbating the existing bias rather than remediating it.
Add more negative (nontoxic) examples without identity terms to the training set.
Identity terms are already underrepresented in negative examples. Adding more negative examples without identity terms would increase this imbalance and would not help remediate the bias.
Add more positive (toxic) examples without identity terms to the training set.

It's possible that adding more positive examples without identity terms might help break the association between identity terms and toxicity that the model had previously learned.

Evaluating for Bias

You've trained your own text-toxicity classifier from scratch, which your engineering team plans to use to automatically suppress display of comments classified as toxic. You're concerned that any bias toward toxicity for gender-related comments might result in suppression of nontoxic discourse about gender, and want to assess gender-related bias in the classifier's predictions. Which of the following metrics should you use to evaluate the model? Explore the options below.
False positive rate (FPR)
In production, the model will be used to automatically suppress positive (toxic) predictions. Your goal is to ensure the model is not suppressing false positives (nontoxic comments that the model misclassified as toxic) for gender-related comments at a higher rate than for comments overall. Comparing FPRs for gender subgroups to overall FPR is a great way to evaluate bias remediation for your use case.
False negative rate (FNR)
FNR measures the rate at which the model misclassifies the positive class (here, "toxic") as the negative class ("nontoxic"). For this use case, it tells you the rate at which actually toxic comments will slip through the filter and be displayed to users. Here, your primary concern is how bias manifests in terms of suppression of nontoxic discourse. FNR doesn't give you any insight into this dimension of the model's performance.
Accuracy measures the percentage of model predictions that were correct, and inversely, the percentage of predictions that were wrong. For this use case, accuracy tells you how likely it is that the filter suppressed nontoxic discourse or displayed toxic discourse. Your primary concern is the former issue, not the latter. Since accuracy conflates the two issues, it's not the ideal evaluation metric to use here.
AUC provides an absolute measurement of a model's predictive ability. It's a good metric for assessing overall performance. However, here you're specifically concerned with comment suppression rates, and AUC doesn't give you direct insight into this issue.
A content moderator has been added to your team, and the product manager has decided to change how your classifier will be deployed. Instead of automatically suppressing the comments classified as toxic, the filtering software will flag these comments for the content moderator to review. Since a human will be reviewing comments labeled as toxic, bias will no longer manifest in the form of content suppression. Which of the following metrics might you want to use to measure bias—and the effect of bias remediation—now? Explore the options below.
False positive rate (FPR)
False positive rate will tell you the percentage of nontoxic comments that were misclassified as toxic. Since a human moderator will now be auditing all comments the model labels "toxic," and should catch most false positives, FPR is no longer a primary concern.
False negative rate (FNR)
While a human moderator will be auditing all comments labeled "toxic" and ensure that false positives are not suppressed, they will not be reviewing comments labeled "nontoxic." This leaves open the possibility of bias related to false negatives. You can use FNR (the percentage of actual positives that were classified as negatives) to systematically evaluate whether toxic commments for gender subgroups are more likely to be labeled as nontoxic than comments overall.
Precision tells you the percentage of positive predictions that are actually positive—in this case, the percentage of "toxic" predictions that are correct. Since a human moderator will be auditing all the "toxic" predictions, you don't need to make precision one of your primary evaluation metrics.
Recall tells you the percentage of actual positives that were classified correctly. From this value, you can derive the percentage of actual positives that were misclassified (1 – recall), which is a useful metric for gauging whether gender-related toxic comments are disproportionally misclassified as "nontoxic" compared to comments overall.