As you explore your data to determine how best to represent it in your model, it's important to also keep issues of fairness in mind and proactively audit for potential sources of bias.
Where might bias lurk? Here are three red flags to look out for in your data set.
Missing Feature Values
If your data set has one or more features that have missing values for a large number of examples, that could be an indicator that certain key characteristics of your data set are under-represented.
For example, the table below shows a summary of key stats for a subset of features in
the California Housing
stored in a pandas
DataFrame and generated via
DataFrame.describe. Note that all features have a
count of 17000, indicating there
are no missing values:
Suppose instead that three features (
only had a count of
3000—in other words, that there were 14,000 missing values for
These 14,000 missing values would make it much more difficult to accurately correlate average income of households with median house prices. Before training a model on this data, it would be prudent to investigate the cause of these missing values to ensure that there are no latent biases responsible for missing income and population data.
Unexpected Feature Values
When exploring data, you should also look for examples that contain feature values that stand out as especially uncharacteristic or unusual. These unexpected feature values could indicate problems that occurred during data collection or other inaccuracies that could introduce bias.
For example, take a look at the following excerpted examples from the California housing data set:
Can you pinpoint any unexpected feature values?
Click the dropdown arrow to see the answer
The longitude and latitude coordinates in example 4 (-103.5 and 43.8, respectively) do not fall within the U.S. state of California. In fact, they are the approximate coordinates of Mount Rushmore National Memorial in the state of South Dakota. This is a bogus example that we inserted into the data set.
Any sort of skew in your data, where certain groups or characteristics may be under- or over-represented relative to their real-world prevalence, can introduce bias into your model.
If you completed the Validation programming exercise, you may recall discovering how a failure to randomize the California housing data set prior to splitting it into training and validation sets resulted in a pronounced data skew. Figure 1 visualizes a subset of data drawn from the full data set that exclusively represents the northwest region of California.
Figure 1. California state map overlaid with data from the California Housing data set. Each dot represents a housing block, with colors ranging from blue to red corresponding to median house price ranging from low to high, respectively.
If this unrepresentative sample were used to train a model to predict California housing prices statewide, the lack of housing data from southern portions of California would be problematic. The geographical bias encoded in the model might adversely affect homebuyers in unrepresented communities.