Word Vectors and Semantic Similarity
Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec and usually look like this:
banana.vector
array([2.02280000e-01, -7.66180009e-02, 3.70319992e-01, 3.28450017e-02, -4.19569999e-01, 7.20689967e-02, -3.74760002e-01, 5.74599989e-02, -1.24009997e-02, 5.29489994e-01, -5.23800015e-01, -1.97710007e-01, -3.41470003e-01, 5.33169985e-01, -2.53309999e-02, 1.73800007e-01, 1.67720005e-01, 8.39839995e-01, 5.51070012e-02, 1.05470002e-01, 3.78719985e-01, 2.42750004e-01, 1.47449998e-02, 5.59509993e-01, 1.25210002e-01, -6.75960004e-01, 3.58420014e-01, # ... and so on ... 3.66849989e-01, 2.52470002e-03, -6.40089989e-01, -2.97650009e-01, 7.89430022e-01, 3.31680000e-01, -1.19659996e+00, -4.71559986e-02, 5.31750023e-01], dtype=float32)
Models that come with built-in word vectors make them available as the
Token.vector
attribute. Doc.vector
and Span.vector
will default to an average of their token
vectors. You can also check if a token has a vector assigned, and get the L2
norm, which can be used to normalize vectors.
import spacy
nlp = spacy.load('en_core_web_md')
tokens = nlp(u'dog cat banana afskfsd')
for token in tokens:
print(token.text, token.has_vector, token.vector_norm, token.is_oov)
The words “dog”, “cat” and “banana” are all pretty common in English, so they’re
part of the model’s vocabulary, and come with a vector. The word “afskfsd” on
the other hand is a lot less common and out-of-vocabulary – so its vector
representation consists of 300 dimensions of 0
, which means it’s practically
nonexistent. If your application will benefit from a large vocabulary with
more vectors, you should consider using one of the larger models or loading in a
full vector package, for example,
en_vectors_web_lg
, which includes over 1
million unique vectors.
spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that’s similar to what they’re currently looking at, or label a support ticket as a duplicate if it’s very similar to an already existing one.
Each Doc
, Span
and Token
comes with a
.similarity()
method that lets you compare it with
another object, and determine the similarity. Of course similarity is always
subjective – whether “dog” and “cat” are similar really depends on how you’re
looking at it. spaCy’s similarity model usually assumes a pretty general-purpose
definition of similarity.
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat banana')
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
In this case, the model’s predictions are pretty on point. A dog is very similar
to a cat, whereas a banana is not very similar to either of them. Identical
tokens are obviously 100% similar to each other (just not always exactly 1.0
,
because of vector math and floating point imprecisions).
Customizing word vectors
Word vectors let you import knowledge from raw text into your model. The knowledge is represented as a table of numbers, with one row per term in your vocabulary. If two terms are used in similar contexts, the algorithm that learns the vectors should assign them rows that are quite similar, while words that are used in different contexts will have quite different values. This lets you use the row-values assigned to the words as a kind of dictionary, to tell you some things about what the words in your text mean.
Word vectors are particularly useful for terms which aren’t well represented in your labelled training data. For instance, if you’re doing named entity recognition, there will always be lots of names that you don’t have examples of. For instance, imagine your training data happens to contain some examples of the term “Microsoft”, but it doesn’t contain any examples of the term “Symantec”. In your raw text sample, there are plenty of examples of both terms, and they’re used in similar contexts. The word vectors make that fact available to the entity recognition model. It still won’t see examples of “Symantec” labelled as a company. However, it’ll see that “Symantec” has a word vector that usually corresponds to company terms, so it can make the inference.
In order to make best use of the word vectors, you want the word vectors table to cover a very large vocabulary. However, most words are rare, so most of the rows in a large word vectors table will be accessed very rarely, or never at all. You can usually cover more than 95% of the tokens in your corpus with just a few thousand rows in the vector table. However, it’s those 5% of rare terms where the word vectors are most useful. The problem is that increasing the size of the vector table produces rapidly diminishing returns in coverage over these rare terms.
Converting word vectors for use in spaCy v2.0.10
Custom word vectors can be trained using a number of open-source libraries, such
as Gensim, Fast Text,
or Tomas Mikolov’s original
word2vec implementation. Most
word vector libraries output an easy-to-read text-based format, where each line
consists of the word followed by its vector. For everyday use, we want to
convert the vectors model into a binary format that loads faster and takes up
less space on disk. The easiest way to do this is the
init-model
command-line utility:
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
This will output a spaCy model in the directory /tmp/la_vectors_wiki_lg
,
giving you access to some nice Latin vectors 😉 You can then pass the directory
path to spacy.load()
.
nlp_latin = spacy.load("/tmp/la_vectors_wiki_lg")
doc1 = nlp_latin(u"Caecilius est in horto")
doc2 = nlp_latin(u"servus est in atrio")
doc1.similarity(doc2)
The model directory will have a /vocab
directory with the strings, lexical
entries and word vectors from the input vectors model. The
init-model
command supports a number of archive formats
for the word vectors: the vectors can be in plain text (.txt
), zipped
(.zip
), or tarred and zipped (.tgz
).
Optimizing vector coverage v2.0
To help you strike a good balance between coverage and memory usage, spaCy’s
Vectors
class lets you map multiple keys to the same
row of the table. If you’re using the
spacy init-model
command to create a vocabulary,
pruning the vectors will be taken care of automatically if you set the
--prune-vectors
flag. You can also do it manually in the following steps:
- Start with a word vectors model that covers a huge vocabulary. For
instance, the
en_vectors_web_lg
model provides 300-dimensional GloVe vectors for over 1 million terms of English. - If your vocabulary has values set for the
Lexeme.prob
attribute, the lexemes will be sorted by descending probability to determine which vectors to prune. Otherwise, lexemes will be sorted by their order in theVocab
. - Call
Vocab.prune_vectors
with the number of vectors you want to keep.
nlp = spacy.load('en_vectors_web_lg')
n_vectors = 105000 # number of vectors to keep
removed_words = nlp.vocab.prune_vectors(n_vectors)
assert len(nlp.vocab.vectors) <= n_vectors # unique vectors have been pruned
assert nlp.vocab.vectors.n_keys > n_vectors # but not the total entries
Vocab.prune_vectors
reduces the current vector
table to a given number of unique entries, and returns a dictionary containing
the removed words, mapped to (string, score)
tuples, where string
is the
entry the removed word was mapped to, and score
the similarity score between
the two words.
Removed words
{ "Shore": ("coast", 0.732257), "Precautionary": ("caution", 0.490973), "hopelessness": ("sadness", 0.742366), "Continous": ("continuous", 0.732549), "Disemboweled": ("corpse", 0.499432), "biostatistician": ("scientist", 0.339724), "somewheres": ("somewheres", 0.402736), "observing": ("observe", 0.823096), "Leaving": ("leaving", 1.0), }
In the example above, the vector for “Shore” was removed and remapped to the vector of “coast”, which is deemed about 73% similar. “Leaving” was remapped to the vector of “leaving”, which is identical.
If you’re using the init-model
command, you can set the
--prune-vectors
option to easily reduce the size of the vectors as you add
them to a spaCy model:
python -m spacy init-model /tmp/la_vectors_web_md --vectors-loc la.300d.vec.tgz --prune-vectors 10000
This will create a spaCy model with vectors for the first 10,000 words in the vectors model. All other words in the vectors model are mapped to the closest vector among those retained.
Adding vectors v2.0
spaCy’s new Vectors
class greatly improves the way word
vectors are stored, accessed and used. The data is stored in two structures:
- An array, which can be either on CPU or GPU.
- A dictionary mapping string-hashes to rows in the table.
Keep in mind that the Vectors
class itself has no
StringStore
, so you have to store the hash-to-string
mapping separately. If you need to manage the strings, you should use the
Vectors
via the Vocab
class, e.g. vocab.vectors
. To add
vectors to the vocabulary, you can use the
Vocab.set_vector
method.
Adding vectors
from spacy.vocab import Vocab vector_data = {u"dog": numpy.random.uniform(-1, 1, (300,)), u"cat": numpy.random.uniform(-1, 1, (300,)), u"orange": numpy.random.uniform(-1, 1, (300,))} vocab = Vocab() for word, vector in vector_data.items(): vocab.set_vector(word, vector)
Loading GloVe vectors v2.0
spaCy comes with built-in support for loading
GloVe vectors from a directory. The
Vectors.from_glove
method assumes a binary format,
the vocab provided in a vocab.txt
, and the naming scheme of
vectors.{size}.[fd
.bin]. For example:
Directory structure
└── vectors ├── vectors.128.f.bin # vectors file └── vocab.txt # vocabulary
File name | Dimensions | Data type |
---|---|---|
vectors.128.f.bin | 128 | float32 |
vectors.300.d.bin | 300 | float64 (double) |
nlp = spacy.load("en_core_web_sm")
nlp.vocab.vectors.from_glove("/path/to/vectors")
If your instance of Language
already contains vectors, they will be
overwritten. To create your own GloVe vectors model package like spaCy’s
en_vectors_web_lg
, you can call
nlp.to_disk
, and then package the model using the
package
command.
Using custom similarity methods
By default, Token.vector
returns the vector for its
underlying Lexeme
, while Doc.vector
and
Span.vector
return an average of the vectors of their
tokens. You can customize these behaviors by modifying the doc.user_hooks
,
doc.user_span_hooks
and doc.user_token_hooks
dictionaries.
Storing vectors on a GPU
If you’re using a GPU, it’s much more efficient to keep the word vectors on the
device. You can do that by setting the Vectors.data
attribute to a cupy.ndarray
object if you’re using spaCy or
Chainer, or a torch.Tensor
object if you’re using
PyTorch. The data
object just needs to support
__iter__
and __getitem__
, so if you’re using another library such as
TensorFlow, you could also create a wrapper for
your vectors data.
spaCy, Thinc or Chainer
import cupy.cuda from spacy.vectors import Vectors vector_table = numpy.zeros((3, 300), dtype="f") vectors = Vectors([u"dog", u"cat", u"orange"], vector_table) with cupy.cuda.Device(0): vectors.data = cupy.asarray(vectors.data)
PyTorch
import torch from spacy.vectors import Vectors vector_table = numpy.zeros((3, 300), dtype="f") vectors = Vectors([u"dog", u"cat", u"orange"], vector_table) vectors.data = torch.Tensor(vectors.data).cuda(0)