When evaluating a model, metrics calculated against an entire test or validation set don't always give an accurate picture of how fair the model is.
Consider a new model developed to predict the presence of tumors that is evaluated against a validation set of 1,000 patients' medical records. 500 records are from female patients, and 500 records are from male patients. The following confusion matrix summarizes the results for all 1,000 examples:
True Positives (TPs): 16 | False Positives (FPs): 4 |
False Negatives (FNs): 6 | True Negatives (TNs): 974 |
$$\text{Precision} = \frac{TP}{TP+FP} = \frac{16}{16+4} = 0.800$$ | |
$$\text{Recall} = \frac{TP}{TP+FN} = \frac{16}{16+6} = 0.727$$ |
These results look promising: precision of 80% and recall of 72.7%. But what happens if we calculate the result separately for each set of patients? Let's break out the results into two separate confusion matrices: one for female patients and one for male patients.
Female Patient Results
True Positives (TPs): 10 | False Positives (FPs): 1 |
False Negatives (FNs): 1 | True Negatives (TNs): 488 |
$$\text{Precision} = \frac{TP}{TP+FP} = \frac{10}{10+1} = 0.909$$ | |
$$\text{Recall} = \frac{TP}{TP+FN} = \frac{10}{10+1} = 0.909$$ |
Male Patient Results
True Positives (TPs): 6 | False Positives (FPs): 3 |
False Negatives (FNs): 5 | True Negatives (TNs): 486 |
$$\text{Precision} = \frac{TP}{TP+FP} = \frac{6}{6+3} = 0.667$$ | |
$$\text{Recall} = \frac{TP}{TP+FN} = \frac{6}{6+5} = 0.545$$ |
When we calculate metrics separately for female and male patients, we see stark differences in model performance for each group.
Female patients:
Of the 11 female patients who actually have tumors, the model correctly predicts positive for 10 patients (recall rate: 90.9%). In other words, the model misses a tumor diagnosis in 9.1% of female cases.
Similarly, when the model returns positive for tumor in female patients, it is correct in 10 out of 11 cases (precision rate: 90.9%); in other words, the model incorrectly predicts tumor in 9.1% of female cases.
Male patients:
However, of the 11 male patients who actually have tumors, the model correctly predicts positive for only 6 patients (recall rate: 54.5%). That means the model misses a tumor diagnosis in 45.5% of male cases.
And when the model returns positive for tumor in male patients, it is correct in only 6 out of 9 cases (precision rate: 66.7%); in other words, the model incorrectly predicts tumor in 33.3% of male cases.
We now have a much better understanding of the biases inherent in the model's predictions, as well as the risks to each subgroup if the model were to be released for medical use in the general population.
Additional Fairness Resources
Fairness is a relatively new subfield within the discipline of machine learning. To learn more about research and initiatives devoted to developing new tools and techniques for identifying and mitigating bias in machine learning models, check out Google's Machine Learning Fairness resources page.