# Representation: Qualities of Good Features

Stay organized with collections Save and categorize content based on your preferences.

We've explored ways to map raw data into suitable feature vectors, but that's only part of the work. We must now explore what kinds of values actually make good features within those feature vectors.

### Avoid rarely used discrete feature values

Good feature values should appear more than 5 or so times in a data set. Doing so enables a model to learn how this feature value relates to the label. That is, having many examples with the same discrete value gives the model a chance to see the feature in different settings, and in turn, determine when it's a good predictor for the label. For example, a house_type feature would likely contain many examples in which its value was victorian:

✔This is a good
example:house_type: victorian


Conversely, if a feature's value appears only once or very rarely, the model can't make predictions based on that feature. For example, unique_house_id is a bad feature because each value would be used only once, so the model couldn't learn anything from it:

The following is an example of a unique value. This should
be avoided.✘unique_house_id: 8SK982ZZ1242Z


### Prefer clear and obvious meanings

Each feature should have a clear and obvious meaning to anyone on the project. For example, the following good feature is clearly named and the value makes sense with respect to the name:

 ✔The meaning of the following value is clear from the label and
value.house_age_years: 27 

Conversely, the meaning of the following feature value is pretty much indecipherable to anyone but the engineer who created it:

✘The following is an
example of a value that is unclear. This should be avoidedhouse_age: 851472000


In some cases, noisy data (rather than bad engineering choices) causes unclear values. For example, the following user_age_years came from a source that didn't check for appropriate values:

✘The following is an example of noisy/bad data. This should be avoided.user_age_years: 277


### Don't mix "magic" values with actual data

Good floating-point features don't contain peculiar out-of-range discontinuities or "magic" values. For example, suppose a feature holds a floating-point value between 0 and 1. So, values like the following are fine:

✔The following is a
good example:quality_rating: 0.82
quality_rating: 0.37


However, if a user didn't enter a quality_rating, perhaps the data set represented its absence with a magic value like the following:

✘The following is an
example of a magic value. This should be avoided.quality_rating: -1


To explicitly mark magic values, create a Boolean feature that indicates whether or not a quality_rating was supplied. Give this Boolean feature a name like is_quality_rating_defined.

In the original feature, replace the magic values as follows:

• For variables that take a finite set of values (discrete variables), add a new value to the set and use it to signify that the feature value is missing.
• For continuous variables, ensure missing values do not affect the model by using the mean value of the feature's data.

### Account for upstream instability

The definition of a feature shouldn't change over time. For example, the following value is useful because the city name probably won't change. (Note that we'll still need to convert a string like "br/sao_paulo" to a one-hot vector.)

✔This is a good
example:city_id: "br/sao_paulo"


But gathering a value inferred by another model carries additional costs. Perhaps the value "219" currently represents Sao Paulo, but that representation could easily change on a future run of the other model:

✘The following is an
example of a value that could change. This should be avoided.inferred_city_cluster: "219"

[{ "type": "thumb-down", "id": "missingTheInformationINeed", "label":"Missing the information I need" },{ "type": "thumb-down", "id": "tooComplicatedTooManySteps", "label":"Too complicated / too many steps" },{ "type": "thumb-down", "id": "outOfDate", "label":"Out of date" },{ "type": "thumb-down", "id": "samplesCodeIssue", "label":"Samples / code issue" },{ "type": "thumb-down", "id": "otherDown", "label":"Other" }]
[{ "type": "thumb-up", "id": "easyToUnderstand", "label":"Easy to understand" },{ "type": "thumb-up", "id": "solvedMyProblem", "label":"Solved my problem" },{ "type": "thumb-up", "id": "otherUp", "label":"Other" }]