word2vec glove

Word2Vec Assumption

The probability of observing a word in the context of another word depends only on the relationship between their word vectors.

This means: if you’re trying to predict a word given its context (or vice versa), you do so based only on how similar their vector representations are.

 Mathematical Expression

Definitions:

  • : The context word (known/input).

  • : The target word (the one we want to predict).

  • : The output (target) word vector for word i.

  • : The input (context) word vector for word wc.

  • : The dot product of the target and context word vectors.

  • : Normalization term over all words in the vocabulary to make it a valid probability.

What does the formula represent?

This is a softmax function:

  • The numerator gives a score (via the exponential of the dot product) for how compatible  and  are.

  • The denominator normalizes this score over the entire vocabulary.

So this gives us the probability of the target word  appearing in the context of .

How to train?

Likelihood : Then minimizing negative log-likelihood:

Update Rule: where:

  • = model parameters (word vectors)
  • = learning rate

Limitation 1 : Need to compute dot product for all V words in the vocab, every time Limitation 2 : Gradient descent requires iterating over whole corpus !!! slow

Negative Sampling

Recall the probability: In order to maximize the probability,

  • Maximize numerator:
  • Minimize denominator:

but doing so by computing over the whole vocabulary is costly.

So instead of minimizing the full softmax denominator, we focus on approximating the softmax denominator using a small number of sampled negative terms

  • Keep the positive pair (wc,wo)(wc​,wo​)

  • Sample a few negative examples w1′,…,wk′w1′​,…,wk′​ from a noise distribution

  • Use a binary classification loss to push the model to distinguish real vs. fake context words

From Softmax to Negative Sampling

Recall the original softmax:

This is expensive because the denominator sums over all words in the vocabulary V.

Negative Sampling Objective

Instead of computing this full probability, we reframe the task as a binary classification problem:

Given a pair (wc,w), is it a real context pair (1) or a negative (0)?

We maximize the likelihood of labeling:

  • Positive pairs (true context pairs) as 1

  • Negative pairs (randomly sampled words) as 0

So the new loss for one pair (wc,wo) becomes:

Where:

  • k = number of negative samples

  •  = noise distribution (often unigram distribution raised to 3/4 power)

  •  = a negative sample (not a real context word for wc)

Intuition

  • Positive term: Pushes  higher — makes dot product large (i.e., vectors more similar)

  • Negative terms: Push  lower — makes dot product small (i.e., vectors more dissimilar)

So you’re teaching the model:

“Real pairs should align; fake pairs should diverge.”

Final Training Loss (Skip-Gram with Negative Sampling)

Instead of summing log-probs from full softmax, we use this modified loss for every training pair (wt,wt+j):

We minimize this using stochastic gradient descent, just like before — but now:

  • We only update the vectors for the targetcontext, and a few negatives.

  • This makes training much, much faster.

Why It Works So Well

  • Scales easily: Instead of updating all vocab vectors, we only update a few (target + k negatives)

  • Good representations: Despite the approximation, it produces high-quality embeddings

  • Empirically effective: Works especially well on large corpora with millions/billions of words