Data Dependencies

Data is as important to ML developers as code is to traditional programmers. This lesson focuses on the kinds of questions you should be asking of your data.

Data Dependencies

  • Input data (features) determine ML system behavior.
    • We write unit tests for software libraries, but what about data?
  • Care is required when choosing input signals.
    • Maybe even more care than when deciding upon which software libraries to depend?
  • Reliability
    • What happens when the signal is not available? Would you know?
  • Reliability
    • What happens when the signal is not available? Would you know?
  • Versioning
    • Does the system that computes this signal ever change? How often? What would happen?
  • Reliability
    • What happens when the signal is not available? Would you know?
  • Versioning
    • Does the system that computes this signal ever change? How often? What would happen?
  • Necessity
    • Does the usefulness of the signal justify the cost of including it?
  • Correlations
    • Are any of my input signals so tied together that we need additional strategies to tease apart?
  • Correlations
    • Are any of my input signals so tied together that we need additional strategies to tease apart?
  • Feedback Loops
    • Which of my input signals may be impacted by my model's outputs?

Video Lecture Summary

The behavior of an ML system is dependent on the behavior and qualities of its input features. As the input data for those features changes, so too will your model. Sometimes that change is desirable, but sometimes it is not.

In traditional software development, you focus more on code than on data. In machine learning development, although coding is still part of the job, your focus must widen to include data. For example, on traditional software development projects, it is a best practice to write unit tests to validate your code. On ML projects, you must also continuously test, verify, and monitor your input data.

For example, you should continuously monitor your model to remove unused (or little used) features. Imagine a certain feature that has been contributing little or nothing to the model. If the input data for that feature abruptly changes, your model's behavior might also abruptly change in undesirable ways.

Reliability

Some questions to ask about the reliability of your input data:

  • Is the signal always going to be available or is it coming from an unreliable source? For example:
    • Is the signal coming from a server that crashes under heavy load?
    • Is the signal coming from humans that go on vacation every August?

Versioning

Some questions to ask about versioning:

  • Does the system that computes this data ever change? If so:
    • How often?
    • How will you know when that system changes?

Sometimes, data comes from an upstream process. If that process changes abruptly, your model can suffer.

Consider creating your own copy of the data you receive from the upstream process. Then, only advance to the next version of the upstream data when you are certain that it is safe to do so.

Necessity

The following question might remind you of regularization:

  • Does the usefulness of the feature justify the cost of including it?

It is always tempting to add more features to the model. For example, suppose you find a new feature whose addition makes your model slightly more accurate. More accuracy certainly sounds better than less accuracy. However, now you've just added to your maintenance burden. That additional feature could degrade unexpectedly, so you've got to monitor it. Think carefully before adding features that lead to minor short-term wins.

Correlations

Some features correlate (positively or negatively) with other features. Ask yourself the following question:

  • Are any features so tied together that you need additional strategies to tease them apart?

Feedback Loops

Sometimes a model can affect its own training data. For example, the results from some models, in turn, are directly or indirectly input features to that same model.

Sometimes a model can affect another model. For example, consider two models for predicting stock prices:

  • Model A, which is a bad predictive model.
  • Model B.

Since Model A is buggy, it mistakenly decides to buy stock in Stock X. Those purchases drive up the price of Stock X. Model B uses the price of Stock X as an input feature, so Model B can easily come to some false conclusions about the value of Stock X stock. Model B could, therefore, buy or sell shares of Stock X based on the buggy behavior of Model A. Model B's behavior, in turn, can affect Model A, possibly triggering a tulip mania or a slide in Company X's stock