Stay organized with collections
Save and categorize content based on your preferences.
Learning objectives
In this module, you will learn to:
Investigate potential issues underlying raw or processed datasets, including
collection and quality issues.
Identify biases, invalid inferences, and rationalizations.
Find common issues in data analysis, including correlation,
relatedness, and irrelevance.
Examine a chart for common problems, misperceptions, and
misleading display and design choices.
ML motivation
While not as glamorous as model architectures and other downstream model work,
data exploration, documentation, and preprocessing are critical to
ML work. ML practitioners can fall into what Nithya Sambasivan et al. called
data cascades
in their 2021 ACM paper
if they do not deeply understand:
the conditions under which their data is collected
the quality, characteristics, and limitations of the data
what the data can and can't show
It's very expensive to train models on bad data and
only find out at the point of low-quality outputs that there were problems
with the data. Likewise, a failure to grasp the limitations of data, human
biases in collecting data, or mistaking correlation for causation,
can result in over-promising and under-delivering results, which can lead to a
loss of trust.
This course walks through common but subtle data traps that ML and data
practitioners may encounter in their work.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-07-26 UTC."],[[["This module teaches you to identify potential issues in datasets, including biases and invalid inferences, ultimately helping you build better ML models."],["Understanding data limitations and collection conditions is crucial to avoid \"data cascades\" that lead to poor model performance and wasted resources."],["The module explores common data analysis pitfalls, such as mistaking correlation for causation, and emphasizes the importance of proper data exploration and preprocessing in machine learning workflows."],["By recognizing common problems in charts and data visualizations, you'll be able to avoid misperceptions and ensure accurate data representation."]]],[]]