Numerical data: Conclusion

  • A machine learning model's predictive ability is directly dependent on the quality of data it's trained on.

  • Numerical features often benefit from normalization or binning to improve model performance.

  • Data validation through verification tests and visualizations is crucial for identifying and addressing potential issues.

  • Understanding data distribution through statistics on both the entire dataset and its subsets is essential for identifying hidden problems.

  • Maintaining thorough documentation of all data transformations ensures reproducibility and facilitates model understanding.

A machine learning (ML) model's health is determined by its data. Feed your model healthy data and it will thrive; feed your model junk and its predictions will be worthless.

Best practices for working with numerical data:

  • Remember that your ML model interacts with the data in the feature vector, not the data in the dataset.
  • Normalize most numerical features.
  • If your first normalization strategy doesn't succeed, consider a different way to normalize your data.
  • Binning, also referred to as bucketing, is sometimes better than normalizing.
  • Considering what your data should look like, write verification tests to validate those expectations. For example:
    • The absolute value of latitude should never exceed 90. You can write a test to check if a latitude value greater than 90 appears in your data.
    • If your data is restricted to the state of Florida, you can write tests to check that the latitudes fall between 24 through 31, inclusive.
  • Visualize your data with scatter plots and histograms. Look for anomalies.
  • Gather statistics not only on the entire dataset but also on smaller subsets of the dataset. That's because aggregate statistics sometimes obscure problems in smaller sections of a dataset.
  • Document all your data transformations.

Data is your most valuable resource, so treat it with care.

Additional Information

What's next

Congratulations on finishing this module!

We encourage you to explore the various MLCC modules at your own pace and interest. If you'd like to follow a recommended order, we suggest that you move to the following module next: Representing categorical data.