Word2Vec Assumption
The probability of observing a word in the context of another word depends only on the relationship between their word vectors.
This means: if you’re trying to predict a word given its context (or vice versa), you do so based only on how similar their vector representations are.
Mathematical Expression
Definitions:
-
: The context word (known/input).
-
: The target word (the one we want to predict).
-
: The output (target) word vector for word i.
-
: The input (context) word vector for word wc.
-
: The dot product of the target and context word vectors.
-
: Normalization term over all words in the vocabulary to make it a valid probability.
What does the formula represent?
This is a softmax function:
-
The numerator gives a score (via the exponential of the dot product) for how compatible and are.
-
The denominator normalizes this score over the entire vocabulary.
So this gives us the probability of the target word appearing in the context of .
How to train?
Likelihood : Then minimizing negative log-likelihood:
Update Rule: where:
- = model parameters (word vectors)
- = learning rate
Limitation 1 : Need to compute dot product for all V words in the vocab, every time Limitation 2 : Gradient descent requires iterating over whole corpus !!! slow
Negative Sampling
Recall the probability: In order to maximize the probability,
- Maximize numerator:
- Minimize denominator:
but doing so by computing over the whole vocabulary is costly.
So instead of minimizing the full softmax denominator, we focus on approximating the softmax denominator using a small number of sampled negative terms
-
Keep the positive pair (wc,wo)(wc,wo)
-
Sample a few negative examples w1′,…,wk′w1′,…,wk′ from a noise distribution
-
Use a binary classification loss to push the model to distinguish real vs. fake context words
From Softmax to Negative Sampling
Recall the original softmax:
This is expensive because the denominator sums over all words in the vocabulary V.
Negative Sampling Objective
Instead of computing this full probability, we reframe the task as a binary classification problem:
Given a pair (wc,w), is it a real context pair (1) or a negative (0)?
We maximize the likelihood of labeling:
-
Positive pairs (true context pairs) as 1
-
Negative pairs (randomly sampled words) as 0
So the new loss for one pair (wc,wo) becomes:
Where:
-
k = number of negative samples
-
= noise distribution (often unigram distribution raised to 3/4 power)
-
= a negative sample (not a real context word for wc)
Intuition
-
Positive term: Pushes higher — makes dot product large (i.e., vectors more similar)
-
Negative terms: Push lower — makes dot product small (i.e., vectors more dissimilar)
So you’re teaching the model:
“Real pairs should align; fake pairs should diverge.”
Final Training Loss (Skip-Gram with Negative Sampling)
Instead of summing log-probs from full softmax, we use this modified loss for every training pair (wt,wt+j):
We minimize this using stochastic gradient descent, just like before — but now:
-
We only update the vectors for the target, context, and a few negatives.
-
This makes training much, much faster.
Why It Works So Well
-
Scales easily: Instead of updating all vocab vectors, we only update a few (target + k negatives)
-
Good representations: Despite the approximation, it produces high-quality embeddings
-
Empirically effective: Works especially well on large corpora with millions/billions of words