For the following questions, click the desired arrow to check your answer:
Let's say you're working on an advertising-related machine learning model and want to predict advertiser spending for January. You have limits on the amount of data you can store on disk, so you must use only a subset of available data. You could use all of the most recent data, which is from the prior month of December. Someone else suggests you sample data throughout the last year. Which might be better and why?
Data from the previous month (December)
While this data is more recent, it may be influenced by seasonal effects of advertiser spending before the December holidays.
Data sampled throughout the year
While this data is old, it's less likely to be influenced by seasonal effects of advertiser spending before the December holidays.
You want to show videos that users want to watch. You use videos they've viewed on YouTube as a label. Is this label direct or derived?
This label is derived because it's not the exact prediction you want to make. Perhaps the user opened the video but closed it shortly afterwards. This event would count as a view even though the user didn't watch the video. In some cases, a heuristic like this might be your only option, but be aware of your label type (direct or derived) and how it limits your predictions.
While that label might result in an accurate prediction much of the time, it is not the exact prediction you want to make.