Fairness: Identifying Bias

As you explore your data to determine how best to represent it in your model, it's important to also keep issues of fairness in mind and proactively audit for potential sources of bias.

Where might bias lurk? Here are three red flags to look out for in your data set.

Missing Feature Values

If your data set has one or more features that have missing values for a large number of examples, that could be an indicator that certain key characteristics of your data set are under-represented.

For example, the table below shows a summary of key stats for a subset of features in the California Housing dataset, stored in a pandas DataFrame and generated via DataFrame.describe. Note that all features have a count of 17000, indicating there are no missing values:

longitude latitude total_rooms population households median_income median_house_value
count 17000.0 17000.0 17000.0 17000.0 17000.0 17000.0 17000.0
mean -119.6 35.6 2643.7 1429.6 501.2 3.9 207.3
std 2.0 2.1 2179.9 1147.9 384.5 1.9 116.0
min -124.3 32.5 2.0 3.0 1.0 0.5 15.0
25% -121.8 33.9 1462.0 790.0 282.0 2.6 119.4
50% -118.5 34.2 2127.0 1167.0 409.0 3.5 180.4
75% -118.0 37.7 3151.2 1721.0 605.2 4.8 265.0
max -114.3 42.0 37937.0 35682.0 6082.0 15.0 500.0

Suppose instead that three features (population, households, and median_income) only had a count of 3000—in other words, that there were 14,000 missing values for each feature:

longitude latitude total_rooms population households median_income median_house_value
count 17000.0 17000.0 17000.0 3000.0 3000.0 3000.0 17000.0
mean -119.6 35.6 2643.7 1429.6 501.2 3.9 207.3
std 2.0 2.1 2179.9 1147.9 384.5 1.9 116.0
min -124.3 32.5 2.0 3.0 1.0 0.5 15.0
25% -121.8 33.9 1462.0 790.0 282.0 2.6 119.4
50% -118.5 34.2 2127.0 1167.0 409.0 3.5 180.4
75% -118.0 37.7 3151.2 1721.0 605.2 4.8 265.0
max -114.3 42.0 37937.0 35682.0 6082.0 15.0 500.0

These 14,000 missing values would make it much more difficult to accurately correlate median income of households with median house prices. Before training a model on this data, it would be prudent to investigate the cause of these missing values to ensure that there are no latent biases responsible for missing income and population data.

Unexpected Feature Values

When exploring data, you should also look for examples that contain feature values that stand out as especially uncharacteristic or unusual. These unexpected feature values could indicate problems that occurred during data collection or other inaccuracies that could introduce bias.

For example, take a look at the following excerpted examples from the California housing data set:

longitude latitude total_rooms population households median_income median_house_value
1 -121.7 38.0 7105.0 3523.0 1088.0 5.0 0.2
2 -122.4 37.8 2479.0 1816.0 496.0 3.1 0.3
3 -122.0 37.0 2813.0 1337.0 477.0 3.7 0.3
4 -103.5 43.8 2212.0 803.0 144.0 5.3 0.2
5 -117.1 32.8 2963.0 1162.0 556.0 3.6 0.2
6 -118.0 33.7 3396.0 1542.0 472.0 7.4 0.4

Can you pinpoint any unexpected feature values?

Data Skew

Any sort of skew in your data, where certain groups or characteristics may be under- or over-represented relative to their real-world prevalence, can introduce bias into your model.

If you completed the Validation programming exercise, you may recall discovering how a failure to randomize the California housing data set prior to splitting it into training and validation sets resulted in a pronounced data skew. Figure 1 visualizes a subset of data drawn from the full data set that exclusively represents the northwest region of California.

A California state map overlaid with data from the California Housing data set.
          Each dot represents a housing block. Dots are all clustered in northwest California,
          with no dots in southern California, illustrating the geographical skew of the data

Figure 1. California state map overlaid with data from the California Housing data set. Each dot represents a housing block, with colors ranging from blue to red corresponding to median house price ranging from low to high, respectively.

If this unrepresentative sample were used to train a model to predict California housing prices statewide, the lack of housing data from southern portions of California would be problematic. The geographical bias encoded in the model might adversely affect homebuyers in unrepresented communities.