Which of the following model's predictions have been affected
by selection bias?
A German handwriting recognition smartphone app uses a model that frequently incorrectly
classifies ß (Eszett)
characters as B
characters, because it was trained on a corpus of American handwriting
samples, mostly written in English.
This model was affected by a type of selection bias called
coverage bias: the training data (American English handwriting) was not
representative of the type of data provided by the model's target
audience (German handwriting).
Engineers built a model to predict the likelihood of a person
developing diabetes based on their daily food intake. The model
was trained on 10,000 "food diaries" collected from a randomly chosen
group of people worldwide representing a variety of different age groups, ethnic
backgrounds, and genders. However, when the model was deployed, it had very poor
accuracy. Engineers subsequently discovered that food diary participants
were reluctant to admit the true volume of unhealthy foods they ate,
and were more likely to document consumption of nutritious food
than less healthy snacks.
There is no selection bias in this model; participants who provided
training data were a representative sampling of users and were chosen randomly.
Instead, this model was affected by reporting bias. Ingestion
of unhealthy foods was reported at a much lower frequency than true real-world
Engineers at a company developed a model to predict staff turnover rates
(the percentage of employees quitting their jobs each year) based on data
collected from a survey sent to all employees. After several years
of use, engineers determined that the model underestimated turnover by more
than 20%. When conducting exit interviews with employees leaving the company,
they learned that more than 80% of people who were dissatisfied with their jobs
chose not to complete the survey, compared to a company-wide opt-out rate of 15%.
This model was affected by a type of selection bias called non-response
bias. People who were dissatisfied with their jobs were underrepresented
in the training data set because they opted out of the company-wide survey
at much higher rates than the entire employee population.
Engineers developing a movie-recommendation system hypothesized that
people who like horror movies will also like science-fiction movies. When
they trained a model on 50,000 users' watchlists, however, it showed no
such correlation between preferences for horror and for sci-fi;
instead it showed a strong correlation between preferences for horror
and for documentaries. This seemed odd to them, so they retrained the
model five more times using different hyperparameters. Their final
trained model showed a 70% correlation between preferences for horror
and for sci-fi, so they confidently released it into production.
There is no evidence of selection bias, but this model may have
instead been affected by experimenter's bias, as the engineers kept
iterating on their model until it confirmed their preexisting hypothesis.