Deep neural network models

Page Summary

Deep Neural Networks (DNNs) for recommendation address limitations of matrix factorization by incorporating side features and improving relevance.
Softmax DNN treats recommendation as a multiclass prediction problem, predicting the probability of user interaction with each item.
DNNs learn embeddings for both queries and items, using a nonlinear function to map features to embeddings.
Two-tower neural networks further enhance DNN models by using separate networks to learn embeddings for queries and items based on their features, enabling the use of item features for improved recommendations.

The previous section showed you how to use matrix factorization to learn embeddings. Some limitations of matrix factorization include:

The difficulty of using side features (that is, any features beyond the query ID/item ID). As a result, the model can only be queried with a user or item present in the training set.
Relevance of recommendations. Popular items tend to be recommended for everyone, especially when using dot product as a similarity measure. It is better to capture specific user interests.

Deep neural network (DNN) models can address these limitations of matrix factorization. DNNs can easily incorporate query features and item features (due to the flexibility of the input layer of the network), which can help capture the specific interests of a user and improve the relevance of recommendations.

Softmax DNN for recommendation

One possible DNN model is softmax, which treats the problem as a multiclass prediction problem in which:

The input is the user query.
The output is a probability vector with size equal to the number of items in the corpus, representing the probability to interact with each item; for example, the probability to click on or watch a YouTube video.

Input

The input to a DNN can include:

dense features (for example, watch time and time since last watch)
sparse features (for example, watch history and country)

Unlike the matrix factorization approach, you can add side features such as age or country. We'll denote the input vector by x.

Image highlighting the input layer in a softmax deep neural network — **Figure 1. The input layer, x.**

Model architecture

The model architecture determines the complexity and expressivity of the model. By adding hidden layers and non-linear activation functions (for example, ReLU), the model can capture more complex relationships in the data. However, increasing the number of parameters also typically makes the model harder to train and more expensive to serve. We will denote the output of the last hidden layer by \(\psi (x) \in \mathbb R^d\).

Image highlighting the hidden layers in a softmax deep neural network — **Figure 2. The output of the hidden layers, \(\psi (x)\).**

Softmax Output: Predicted Probability Distribution

The model maps the output of the last layer, \(\psi (x)\), through a softmax layer to a probability distribution \(\hat p = h(\psi(x) V^T)\), where:

\(h : \mathbb R^n \to \mathbb R^n\) is the softmax function, given by \(h(y)_i=\frac{e^{y_i}}{\sum_j e^{y_j}}\)
\(V \in \mathbb R^{n \times d}\) is the matrix of weights of the softmax layer.

The softmax layer maps a vector of scores \(y \in \mathbb R^n\) (sometimes called the logits) to a probability distribution.

Image showing a predicted probability distribution in a softmax deep neural network — **Figure 3. The predicted probability distribution, \(\hat p = h(\psi(x) V^T)\).**

Loss Function

Finally, define a loss function that compares the following:

\(\hat p\), the output of the softmax layer (a probability distribution)
\(p\), the ground truth, representing the items the user has interacted with (for example, YouTube videos the user clicked or watched). This can be represented as a normalized multi-hot distribution (a probability vector).

For example, you can use the cross-entropy loss since you are comparing two probability distributions.

Image showing the loss function in a softmax deep neural network — **Figure 4. The loss function.**

Softmax Embeddings

The probability of item \(j\) is given by \(\hat p_j = \frac{\exp(\langle \psi(x), V_j\rangle)}{Z}\), where \(Z\) is a normalization constant that does not depend on \(j\).

In other words, \(\log(\hat p_j) = \langle \psi(x), V_j\rangle - log(Z)\), so the log probability of an item \(j\) is (up to an additive constant) the dot product of two \(d\)-dimensional vectors, which can be interpreted as query and item embeddings:

\(\psi(x) \in \mathbb R^d\) is the output of the last hidden layer. We call it the embedding of the query \(x\).
\(V_j \in \mathbb R^d\) is the vector of weights connecting the last hidden layer to output j. We call it the embedding of item \(j\).

Image showing embeddings in a softmax deep neural network — **Figure 5. Embedding of item \(j\), \(V_j \in \mathbb R^d\)**

DNN and Matrix Factorization

In both the softmax model and the matrix factorization model, the system learns one embedding vector \(V_j\) per item \(j\). What we called the item embedding matrix \(V \in \mathbb R^{n \times d}\) in matrix factorization is now the matrix of weights of the softmax layer.

The query embeddings, however, are different. Instead of learning one embedding \(U_i\) per query \(i\), the system learns a mapping from the query feature \(x\) to an embedding \(\psi(x) \in \mathbb R^d\). Therefore, you can think of this DNN model as a generalization of matrix factorization, in which you replace the query side by a nonlinear function \(\psi(\cdot)\).

Can You Use Item Features?

Can you apply the same idea to the item side? That is, instead of learning one embedding per item, can the model learn a nonlinear function that maps item features to an embedding? Yes. To do so, use a two-tower neural network, which consists of two neural networks:

One neural network maps query features \(x_{\text{query}}\) to query embedding \(\psi(x_{\text{query}}) \in \mathbb R^d\)
One neural network maps item features \(x_{\text{item}}\) to item embedding \(\phi(x_{\text{item}}) \in \mathbb R^d\)

The output of the model can be defined as the dot product of \(\langle \psi(x_{\text{query}}), \phi(x_{\text{item}}) \rangle\). Note that this is not a softmax model anymore. The new model predicts one value per pair \((x_{\text{query}}, x_{\text{item}})\) instead of a probability vector for each query \(x_{\text{query}}\).

Advantages & disadvantages

Softmax training