02. Word2Vec

Word2Vec Assumption

The probability of observing a word in the context of another word depends only on the relationship between their word vectors.

This means: if you’re trying to predict a word given its context (or vice versa), you do so based only on how similar their vector representations are.

Mathematical Expression

$P (w_{o} ∣ w_{c}) = \frac{e ^{u^{T} w_{o} v_{w_{c}}}}{\sum _{i} e ^{u_{i}^{T} v_{w_{c}}}}$

Definitions:

$w_{c}$ : The context word (known/input).
$w_{o}$ : The target word (the one we want to predict).
$u_{i}$ : The output (target) word vector for word i.
$v_{w_{c}}$ : The input (context) word vector for word wc.
$u^{T} w_{o} v_{w_{c}}$ : The dot product of the target and context word vectors.
$\sum_{i} e^{u_{i}^{T} v_{w_{c}}}$ : Normalization term over all words in the vocabulary to make it a valid probability.

What does the formula represent?

This is a softmax function:

The numerator gives a score (via the exponential of the dot product) for how compatible $w_{o}$ and $w_{c}$ are.
The denominator normalizes this score over the entire vocabulary.

So this gives us the probability of the target word $w_{o}$ appearing in the context of $w_{c}$ .

How to train?

Likelihood : $L (D; θ) = \prod_{t = 0}^{T} \prod_{- k \leq j \leq k j \neq = 0} P (w_{t + j} ∣ w_{t}; θ)$ Then minimizing negative log-likelihood:

J (θ) = - t = 0 \sum T - k \leq j \leq k j \neq = 0 \sum lo g P (w_{t + j} ∣ w_{t}; θ)

Update Rule: $θ \leftarrow θ - η \cdot \nabla_{θ} J (θ)$ where:

$θ$ = model parameters (word vectors)
$η$ = learning rate
$\nabla_{θ} J (θ) = - \sum_{t = 0}^{T} \sum_{- k \leq j \leq k j \neq = 0} \nabla_{θ} lo g P (w_{t + j} ∣ w_{t}; θ)$

Limitation 1 : Need to compute dot product for all V words in the vocab, every time Limitation 2 : Gradient descent requires iterating over whole corpus !!! slow

Negative Sampling

Recall the probability: $P (w_{o} ∣ w_{c}) = \frac{e ^{u^{T} w_{o} v_{w_{c}}}}{\sum _{i} e ^{u_{i}^{T} v_{w_{c}}}}$ In order to maximize the probability,

Maximize numerator: $e^{u^{T} w_{o} v_{w_{c}}}$
Minimize denominator: $\sum_{i} e^{u_{i}^{T} v_{w_{c}}}$

but doing so by computing over the whole vocabulary is costly.

So instead of minimizing the full softmax denominator, we focus on approximating the softmax denominator using a small number of sampled negative terms

Keep the positive pair (wc,wo)(wc,wo)
Sample a few negative examples w1′,…,wk′w1′,…,wk′ from a noise distribution
Use a binary classification loss to push the model to distinguish real vs. fake context words

From Softmax to Negative Sampling

Recall the original softmax:

P (w_{o} ∣ w_{c}) = \frac{e ^{u_{w_{o}}^{T} v_{w_{c}}}}{\sum _{i = 1}^{∣ V ∣} e ^{u_{i}^{T} v_{w_{c}}}}

This is expensive because the denominator sums over all words in the vocabulary V.

Negative Sampling Objective

Instead of computing this full probability, we reframe the task as a binary classification problem:

Given a pair (wc,w), is it a real context pair (1) or a negative (0)?

We maximize the likelihood of labeling:

Positive pairs (true context pairs) as 1
Negative pairs (randomly sampled words) as 0

So the new loss for one pair (wc,wo) becomes:

lo g σ (u_{w_{o}}^{T} v_{w_{c}}) + i = 1 \sum k E_{w_{i}^{'} \sim P_{n} (w)} [lo g σ (- u_{w_{i}^{'}}^{T} v_{w_{c}})]

Where:

k = number of negative samples
$P_{n} (w)$ = noise distribution (often unigram distribution raised to 3/4 power)
$w_{i}'$ = a negative sample (not a real context word for wc)

Intuition

Positive term: Pushes $u_{w_{o}}^{T} v_{w_{c}}$ higher — makes dot product large (i.e., vectors more similar)
Negative terms: Push $u_{w_{i}^{'}}^{T} v_{w_{c}}$ lower — makes dot product small (i.e., vectors more dissimilar)

So you’re teaching the model:

“Real pairs should align; fake pairs should diverge.”

Final Training Loss (Skip-Gram with Negative Sampling)

Instead of summing log-probs from full softmax, we use this modified loss for every training pair (wt,wt+j):

J (θ) = - t = 1 \sum T - k \leq j \leq k j \neq = 0 \sum [lo g σ (u_{w_{t + j}}^{T} v_{w_{t}}) + i = 1 \sum K lo g σ (- u_{w_{i}^{'}}^{T} v_{w_{t}})]

We minimize this using stochastic gradient descent, just like before — but now:

We only update the vectors for the target, context, and a few negatives.
This makes training much, much faster.

Why It Works So Well

Scales easily: Instead of updating all vocab vectors, we only update a few (target + k negatives)
Good representations: Despite the approximation, it produces high-quality embeddings
Empirically effective: Works especially well on large corpora with millions/billions of words

Quartz 4

Explorer