Embeddings: Interactive exercises

The following widget, based on TensorFlow's Embedding Projector, flattens 10,000 word2vec static vectors into a 3D space. This collapse of dimensions can be misleading, because the points closest to each other in the original high-dimensional space may appear farther apart in the 3D projection. The closest n points are highlighted in purple, with n chosen by the user in Isolate __ points. The sidebar on the right identifies those nearest neighbors.

In these experiments, you'll play with the word2vec embeddings in the widget above.

Task 1

Try to find the 20 nearest neighbors for the following, and see where the groups fall in the cloud.

  • iii, third, and three
  • tao and way
  • orange, yellow, and juice

What do you notice about these results?

Task 2

Try to figure out some characteristics of the training data. For example, try to find the 100 nearest neighbors for the following, and see where the groups are in the cloud:

  • boston, paris, tokyo, delhi, moscow, and seoul (this is a trick question)
  • jane, sarah, john, peter, rosa, and juan

Task 3

Embeddings aren't limited to words. Images, audio, and other data can also be embedded. For this task:

  1. Open TensorFlow's Embedding Projector.
  2. In the left sidebar titled Data, choose Mnist with images. This brings up a projection of the embeddings of the MNIST database of handwritten digits.
  3. Click to stop the rotation and choose a single image. Zoom in and out as needed.
  4. Look in the right sidebar for nearest neighbors. Are there any surprises?
  • Why do some 7s have 1s as their nearest neighbor? Why do some 8s have 9 as their nearest neighbor?
  • Is there anything about the images on the edges of the projection space that seem different from the images in the center of the projection space?

Keep in mind that the model that created these embeddings is receiving image data, which is to say, pixels, and choosing a numerical vector representation for each image. The model doesn't make an automatic mental association between the image of the handwritten digit and the numerical digit itself.