Data is as important to ML developers as code is to traditional programmers. This lesson focuses on the kinds of questions you should be asking of your data.
Video Lecture Summary
The behavior of an ML system is dependent on the behavior and qualities of its input features. As the input data for those features changes, so too will your model. Sometimes that change is desirable, but sometimes it is not.
In traditional software development, you focus more on code than on data. In machine learning development, although coding is still part of the job, your focus must widen to include data. For example, on traditional software development projects, it is a best practice to write unit tests to validate your code. On ML projects, you must also continuously test, verify, and monitor your input data.
For example, you should continuously monitor your model to remove unused (or little used) features. Imagine a certain feature that has been contributing little or nothing to the model. If the input data for that feature abruptly changes, your model's behavior might also abruptly change in undesirable ways.
Reliability
Some questions to ask about the reliability of your input data:
- Is the signal always going to be available or is it coming from an unreliable source? For example:
- Is the signal coming from a server that crashes under heavy load?
- Is the signal coming from humans that go on vacation every August?
Versioning
Some questions to ask about versioning:
- Does the system that computes this data ever change? If so:
- How often?
- How will you know when that system changes?
Sometimes, data comes from an upstream process. If that process changes abruptly, your model can suffer.
Consider creating your own copy of the data you receive from the upstream process. Then, only advance to the next version of the upstream data when you are certain that it is safe to do so.
Necessity
The following question might remind you of regularization:
- Does the usefulness of the feature justify the cost of including it?
It is always tempting to add more features to the model. For example, suppose you find a new feature whose addition makes your model slightly more accurate. More accuracy certainly sounds better than less accuracy. However, now you've just added to your maintenance burden. That additional feature could degrade unexpectedly, so you've got to monitor it. Think carefully before adding features that lead to minor short-term wins.
Correlations
Some features correlate (positively or negatively) with other features. Ask yourself the following question:
- Are any features so tied together that you need additional strategies to tease them apart?
Feedback Loops
Sometimes a model can affect its own training data. For example, the results from some models, in turn, are directly or indirectly input features to that same model.
Sometimes a model can affect another model. For example, consider two models for predicting stock prices:
- Model A, which is a bad predictive model.
- Model B.
Since Model A is buggy, it mistakenly decides to buy stock in Stock X. Those purchases drive up the price of Stock X. Model B uses the price of Stock X as an input feature, so Model B can easily come to some false conclusions about the value of Stock X stock. Model B could, therefore, buy or sell shares of Stock X based on the buggy behavior of Model A. Model B's behavior, in turn, can affect Model A, possibly triggering a tulip mania or a slide in Company X's stock