Stay organized with collections
Save and categorize content based on your preferences.
The following widget, based on TensorFlow's
Embedding Projector, flattens 10,000
word2vec static vectors into a 3D space. This collapse of dimensions can be
misleading, because the points closest to each other in the original
high-dimensional space may appear farther apart in the 3D projection. The
closest n points are highlighted in purple, with n chosen by the user in
Isolate __ points. The sidebar on the right identifies those nearest
neighbors.
In these experiments, you'll play with the word2vec embeddings in the widget
above.
Task 1
Try to find the 20 nearest neighbors for the following, and see where the
groups fall in the cloud.
iii, third, and three
tao and way
orange, yellow, and juice
What do you notice about these results?
Click here for our answer
Even though iii, third, and three
are semantically similar, they appear in different contexts in text and
don't appear to be close together in this embedding space. In
word2vec, iii is closer to iv than to
third.
Similarly, while way is a direct translation of tao,
these words most frequently occur with completely different groups of words
in the dataset used, and so the two vectors are very far apart.
The first several nearest neighbors of orange are colors, but
juice and peel, related to the meaning of
orange as fruit, show up as the 14th
and 18th nearest neighbors. prince, meanwhile, as in the
Prince of Orange, is 17th. In the projection, the words closest to
orange are yellow and other
colors, while the closest words to juice don't include
orange.
Task 2
Try to figure out some characteristics of the training data. For example, try
to find the 100 nearest neighbors for the following, and see where the groups
are in the cloud:
boston, paris, tokyo, delhi, moscow, and seoul (this is a trick
question)
jane, sarah, john, peter, rosa, and juan
Click here for our answer
Many of the nearest neighbors to boston are other cities in
the US. Many of the nearest neighbors to paris are other cities
in Europe. tokyo and delhi don't seem to have
similar results: one is associated with cities around the world that are
travel hubs, while the other is associated with india and related
words. seoul doesn't appear in this trimmed-down set of
word vectors at all.
It seems that this dataset contains many documents related to US national
geography, some documents relate to European regional geography, and not
much fine-grained coverage of other countries or regions.
Similarly, this dataset seems to contain many male English names, some female
English names, and far fewer names from other languages. Note that Don Rosa
wrote and illustrated Scrooge McDuck comics for Disney, which is the likely
reason that `scrooge` and `mcduck` are among the nearest neighbors for `rosa`.
In the left sidebar titled Data, choose Mnist with images. This
brings up a projection of the embeddings of the
MNIST
database of
handwritten digits.
Click to stop the rotation and choose a single image. Zoom in and out as
needed.
Look in the right sidebar for nearest neighbors. Are there any surprises?
Why do some 7s have 1s as their nearest neighbor? Why do some 8s have
9 as their nearest neighbor?
Is there anything about the images on the edges of the projection space
that seem different from the images in the center of the projection space?
Keep in mind that the model that created these embeddings is receiving image
data, which is to say, pixels, and choosing a numerical vector representation
for each image. The model doesn't make an automatic mental association
between the image of the handwritten digit and the numerical digit itself.
Click here for our answer
Due to similarities in shape, the vector representations of some of the
skinnier, narrower 7s are placed closer to the vectors for
handwritten 1s. The same thing happens for some 8s
and 9s, and even some of the 5s and 3s.
The handwritten digits on the outside of the projection space appear
more strongly definable as one of the nine digits and strongly differentiated
from other possible digits.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-04-15 UTC."],[],[],null,["The following widget, based on TensorFlow's\n[Embedding Projector](https://projector.tensorflow.org/), flattens 10,000\n`word2vec` static vectors into a 3D space. This collapse of dimensions can be\nmisleading, because the points closest to each other in the original\nhigh-dimensional space may appear farther apart in the 3D projection. The\nclosest *n* points are highlighted in purple, with *n* chosen by the user in\n**Isolate __ points**. The sidebar on the right identifies those nearest\nneighbors. \n\nIn these experiments, you'll play with the `word2vec` embeddings in the widget\nabove.\n\nTask 1\n\nTry to find the 20 nearest neighbors for the following, and see where the\ngroups fall in the cloud.\n\n- `iii`, `third`, and `three`\n- `tao` and `way`\n- `orange`, `yellow`, and `juice`\n\nWhat do you notice about these results? \n**Click here for our answer**\n\nEven though `iii`, `third`, and `three`\nare semantically similar, they appear in different contexts in text and\ndon't appear to be close together in this embedding space. In\n`word2vec`, `iii` is closer to `iv` than to\n`third`.\n\nSimilarly, while `way` is a direct translation of `tao`,\nthese words most frequently occur with completely different groups of words\nin the dataset used, and so the two vectors are very far apart.\n\nThe first several nearest neighbors of `orange` are colors, but\n`juice` and `peel`, related to the meaning of\n`orange` as fruit, show up as the 14th\nand 18th nearest neighbors. `prince`, meanwhile, as in the\nPrince of Orange, is 17th. In the projection, the words closest to\n`orange` are `yellow` and other\ncolors, while the closest words to `juice` don't include\n`orange`.\n\nTask 2\n\nTry to figure out some characteristics of the training data. For example, try\nto find the 100 nearest neighbors for the following, and see where the groups\nare in the cloud:\n\n- `boston`, `paris`, `tokyo`, `delhi`, `moscow`, and `seoul` (this is a trick question)\n- `jane`, `sarah`, `john`, `peter`, `rosa`, and `juan`\n\n**Click here for our answer**\n\nMany of the nearest neighbors to `boston` are other cities in\nthe US. Many of the nearest neighbors to `paris` are other cities\nin Europe. `tokyo` and `delhi` don't seem to have\nsimilar results: one is associated with cities around the world that are\ntravel hubs, while the other is associated with `india` and related\nwords. `seoul` doesn't appear in this trimmed-down set of\nword vectors at all.\n\nIt seems that this dataset contains many documents related to US national\ngeography, some documents relate to European regional geography, and not\nmuch fine-grained coverage of other countries or regions.\n\nSimilarly, this dataset seems to contain many male English names, some female\nEnglish names, and far fewer names from other languages. Note that Don Rosa\nwrote and illustrated Scrooge McDuck comics for Disney, which is the likely\nreason that \\`scrooge\\` and \\`mcduck\\` are among the nearest neighbors for \\`rosa\\`.\n\nThe pre-trained word vectors offered by `word2vec` were in fact\ntrained on\n[Google News articles up to 2013](https://code.google.com/archive/p/word2vec/).\n\nTask 3\n\nEmbeddings aren't limited to words. Images, audio, and other data can also be\nembedded. For this task:\n\n1. Open TensorFlow's [Embedding Projector](https://projector.tensorflow.org/).\n2. In the left sidebar titled **Data** , choose **Mnist with images** . This brings up a projection of the embeddings of the [MNIST](https://developers.google.com/machine-learning/glossary#mnist) database of handwritten digits.\n3. Click to stop the rotation and choose a single image. Zoom in and out as needed.\n4. Look in the right sidebar for nearest neighbors. Are there any surprises?\n\n- Why do some `7`s have `1`s as their nearest neighbor? Why do some `8`s have `9` as their nearest neighbor?\n- Is there anything about the images on the edges of the projection space that seem different from the images in the center of the projection space?\n\nKeep in mind that the model that created these embeddings is receiving image\ndata, which is to say, pixels, and choosing a numerical vector representation\nfor each image. The model doesn't make an automatic mental association\nbetween the image of the handwritten digit and the numerical digit itself. \n**Click here for our answer**\n\nDue to similarities in shape, the vector representations of some of the\nskinnier, narrower `7`s are placed closer to the vectors for\nhandwritten `1`s. The same thing happens for some `8`s\nand `9`s, and even some of the `5`s and `3`s.\n\nThe handwritten digits on the outside of the projection space appear\nmore strongly definable as one of the nine digits and strongly differentiated\nfrom other possible digits.\n| **Key terms:**\n|\n| - [Embedding vector](/machine-learning/glossary#embedding-vector)\n- [Embedding space](/machine-learning/glossary#embedding-space) \n[Help Center](https://support.google.com/machinelearningeducation)"]]