Production ML systems: Static versus dynamic inference
Stay organized with collections
Save and categorize content based on your preferences.
Inference is the process of
making predictions by applying a trained model to
unlabeled examples.
Broadly speaking, a model can infer predictions in one of two ways:
Static inference (also called offline inference or
batch inference) means the model makes predictions on a bunch of
common unlabeled examples
and then caches those predictions somewhere.
Dynamic inference (also called online inference or real-time
inference) means that the model only makes predictions on demand,
for example, when a client requests a prediction.
To use an extreme example, imagine a very complex model that
takes one hour to infer a prediction.
This would probably be an excellent situation for static inference:
Suppose this same complex model mistakenly uses dynamic inference instead of
static inference. If many clients request predictions around the same time,
most of them won't receive that prediction for hours or days.
Now consider a model that infers quickly, perhaps in 2 milliseconds using a
relative minimum of computational resources. In this situation, clients can
receive predictions quickly and efficiently through dynamic inference, as
suggested in Figure 5.
Static inference
Static inference offers certain advantages and disadvantages.
Advantages
Don't need to worry much about cost of inference.
Can do post-verification of predictions before pushing.
Disadvantages
Can only serve cached predictions, so the system might not be
able to serve predictions for uncommon input examples.
Update latency is likely measured in hours or days.
Dynamic inference
Dynamic inference offers certain advantages and disadvantages.
Advantages
Can infer a prediction on any new item as it comes in, which
is great for long tail (less common) predictions.
Disadvantages
Compute intensive and latency sensitive. This combination may limit model
complexity; that is, you might have to build a simpler model that can
infer predictions more quickly than a complex model could.
Monitoring needs are more intensive.
Exercises: Check your understanding
Which three of the following four statements are
true of static inference?
The model must create predictions for all possible inputs.
Yes, the model must make predictions for all possible inputs and
store them in a cache or lookup table.
If the set of things that the model is predicting is limited, then
static inference might be a good choice.
However, for free-form inputs like user queries that have a long
tail of unusual or rare items, static inference can't provide
full coverage.
The system can verify inferred predictions before serving
them.
Yes, this is a useful aspect of static inference.
For a given input, the model can serve a prediction more quickly
than dynamic inference.
Yes, static inference can almost always serve predictions faster
than dynamic inference.
You can react quickly to changes in the world.
No, this is a disadvantage of static inference.
Which one of the following statements is
true of dynamic inference?
You can provide predictions for all possible items.
Yes, this is a strength of dynamic inference. Any request that
comes in will be given a score. Dynamic inference handles long-tail
distributions (those with many rare items), like the space of all
possible sentences written in movie reviews.
You can do post-verification of predictions before they
are used.
In general, it's not possible to do a post-verification of all
predictions before they get used because predictions are being
made on demand. You can, however, potentially monitor
aggregate prediction qualities to provide some level of
quality checking, but these will signal fire alarms only after
the fire has already spread.
When performing dynamic inference, you don't need to worry
about prediction latency (the lag time for returning predictions)
as much as when performing static inference.
Prediction latency is often a real concern in dynamic inference.
Unfortunately, you can't necessarily fix prediction latency issues
by adding more inference servers.