How do we measure the relation between words?
Languages are ambiguous:
-
Lexical ambiguity: A word has multiple meanings.
Example: “bank”→ (financial institution vs. river edge)
-
Syntactic ambiguity: Multiple possible sentence structures.
Example: “I saw the man with the telescope.”
-
Semantic ambiguity: Meaning is unclear due to missing context.
Example: “They are flying planes.” (Who is flying?)
How do humans deal with these ambiguities?
By relying on context and experience.
Insight: If humans resolve ambiguity through context, then deep learning might be able to do the same using large datasets.
WordNet
- A human-curated lexical database.
- Defines explicit relationships between words:
- Synonymy (same meaning)
- Antonymy (opposite meaning)
- Hypernymy (is-a relationship)
- Meronymy (part-of relationship)
- Structured as a large semantic graph .
Example:
- “dog” → hypernym: “canine” → hypernym: “animal”
- “car” → meronym: “wheel”, “engine”
Limitations:
- Requires a lot of human effort (expensive)
- Based on subjective judgments
- Not scalable across domains or languages
- Doesn’t handle semantic fuzziness (e.g., similarity between “car” and “truck”)
Word Representations: From Symbols to Vectors
First Idea: One-Hot Encoding
- Represent each word as a one-hot vector : a binary vector of length V (vocabulary size) with a single 1 and the rest 0s.
Example (for vocab = [apple
, banana
, cat
]):
-
“apple” → [1, 0, 0]
-
“banana” → [0, 1, 0]
-
“cat” → [0, 0, 1]
-
These are standard basis vectors in ℝ^V.
Limitations:
- Sparse and high-dimensional
- No notion of similarity — all vectors are orthogonal
- “cat” and “dog” are as unrelated as “cat” and “banana”
Improved Idea: Word Embeddings
-
Goal: Learn a continuous vector for each word that captures semantic relationships.
-
No longer treat words as isolated symbols. Instead, learn representations from *how words occur in context *.
How do we capture similarity between words?
We don’t use WordNet’s manual graph approach. Instead, we follow ideas from philosophy and linguistics:
- Ludwig Wittgenstein :
“The meaning of a word is its use in the language.”
- John Firth :
“You shall know a word by the company it keeps.”
This leads to the Distributional Hypothesis :
Words that appear in similar contexts tend to have similar meanings.
What is context?
Context can be defined in different ways:
- Document
- Paragraph
- Sentence
- Fixed-size window around a word
K-Window Context
A simple and popular approach:
- For a word at position i, define its context as the words from (i - k) to (i + k), excluding the word itself.
Term-Context Matrix
- Build a vocabulary-size × vocabulary-size matrix
- Rows = target words
- Columns = context words
- Values = counts of how often each word appears in the context of another
How to compute similarity?
One simple approach: dot product between word vectors
- Limitation 1: sensitive to vector length (scale)
- Limitation 2: Biased by high-frequency words (e.g., “UNK”, “the”)
- These words appear in many contexts and accumulate large vector norms
- This can make them falsely appear highly similar to many words
A better approach: cosine similarity
- Formula
- Measures angle between vectors, not magnitude
- Values range from -1 to 1
- Captures semantic similarity more robustly
- Limitation1: Still with long V, the wordvec is mostly sparse…
Reduce dimension using SVD
- Limitation2: Frequent but uninformative word, may make unrelated words seem to have similar context
Use PPMI
Learnable WordVec
- Counting-based word representations work, but they are static.
- Meaning is bounded to observed co-occurrences only — no generalization.
- Sparsity of count-based vectors makes them memory-inefficient and hard to scale.
So instead of memorizing co-occurrence counts, we learn word vectors that generalize meaning and compress information efficiently.
→ Continued on the next page: 2. Word2Vec