01. Word Representation

How do we measure the relation between words?

Languages are ambiguous:

Lexical ambiguity: A word has multiple meanings.

Example: “bank”→ (financial institution vs. river edge)
Syntactic ambiguity: Multiple possible sentence structures.

Example: “I saw the man with the telescope.”
Semantic ambiguity: Meaning is unclear due to missing context.

Example: “They are flying planes.” (Who is flying?)

How do humans deal with these ambiguities?

By relying on context and experience.

Insight: If humans resolve ambiguity through context, then deep learning might be able to do the same using large datasets.

WordNet

A human-curated lexical database.
Defines explicit relationships between words:
- Synonymy (same meaning)
- Antonymy (opposite meaning)
- Hypernymy (is-a relationship)
- Meronymy (part-of relationship)
Structured as a large semantic graph .

Example:

“dog” → hypernym: “canine” → hypernym: “animal”
“car” → meronym: “wheel”, “engine”

Limitations:

Requires a lot of human effort (expensive)
Based on subjective judgments
Not scalable across domains or languages
Doesn’t handle semantic fuzziness (e.g., similarity between “car” and “truck”)

Word Representations: From Symbols to Vectors

First Idea: One-Hot Encoding

Represent each word as a one-hot vector : a binary vector of length V (vocabulary size) with a single 1 and the rest 0s.

Example (for vocab = [apple, banana, cat]):

“apple” → [1, 0, 0]
“banana” → [0, 1, 0]
“cat” → [0, 0, 1]
These are standard basis vectors in ℝ^V.

Limitations:

Sparse and high-dimensional
No notion of similarity — all vectors are orthogonal
- “cat” and “dog” are as unrelated as “cat” and “banana”

Improved Idea: Word Embeddings

Goal: Learn a continuous vector for each word that captures semantic relationships.
No longer treat words as isolated symbols. Instead, learn representations from *how words occur in context *.

How do we capture similarity between words?

We don’t use WordNet’s manual graph approach. Instead, we follow ideas from philosophy and linguistics:

Ludwig Wittgenstein :

“The meaning of a word is its use in the language.”

John Firth :

“You shall know a word by the company it keeps.”

This leads to the Distributional Hypothesis :

Words that appear in similar contexts tend to have similar meanings.

What is context?

Context can be defined in different ways:

Document
Paragraph
Sentence
Fixed-size window around a word

K-Window Context

A simple and popular approach:

For a word at position i, define its context as the words from (i - k) to (i + k), excluding the word itself.

Term-Context Matrix

Build a vocabulary-size × vocabulary-size matrix
Rows = target words
Columns = context words
Values = counts of how often each word appears in the context of another

How to compute similarity?

One simple approach: dot product between word vectors

Limitation 1: sensitive to vector length (scale)
Limitation 2: Biased by high-frequency words (e.g., “UNK”, “the”)
- These words appear in many contexts and accumulate large vector norms
- This can make them falsely appear highly similar to many words

A better approach: cosine similarity

cosine_similarity

Formula

cosine_similarity (u, v) = \frac{u \cdot v}{∥ u ∥ \cdot ∥ v ∥}

Measures angle between vectors, not magnitude
Values range from -1 to 1
Captures semantic similarity more robustly
Limitation1: Still with long V, the wordvec is mostly sparse…

Reduce dimension using SVD
Limitation2: Frequent but uninformative word, may make unrelated words seem to have similar context

Use PPMI

Learnable WordVec

Counting-based word representations work, but they are static.
Meaning is bounded to observed co-occurrences only — no generalization.
Sparsity of count-based vectors makes them memory-inefficient and hard to scale.

So instead of memorizing co-occurrence counts, we learn word vectors that generalize meaning and compress information efficiently.

→ Continued on the next page: 2. Word2Vec

Quartz 4

Explorer