How do we measure the relation between words?

Languages are ambiguous:

  • Lexical ambiguity: A word has multiple meanings.

    Example: “bank”→ (financial institution vs. river edge)

  • Syntactic ambiguity: Multiple possible sentence structures.

    Example: “I saw the man with the telescope.”

  • Semantic ambiguity: Meaning is unclear due to missing context.

    Example: “They are flying planes.” (Who is flying?)

How do humans deal with these ambiguities?

By relying on context and experience.

Insight: If humans resolve ambiguity through context, then deep learning might be able to do the same using large datasets.

WordNet

  • A human-curated lexical database.
  • Defines explicit relationships between words:
    • Synonymy (same meaning)
    • Antonymy (opposite meaning)
    • Hypernymy (is-a relationship)
    • Meronymy (part-of relationship)
  • Structured as a large semantic graph .

Example:

  • “dog” → hypernym: “canine” → hypernym: “animal”
  • “car” → meronym: “wheel”, “engine”

Limitations:

  • Requires a lot of human effort (expensive)
  • Based on subjective judgments
  • Not scalable across domains or languages
  • Doesn’t handle semantic fuzziness (e.g., similarity between “car” and “truck”)

Word Representations: From Symbols to Vectors

First Idea: One-Hot Encoding

  • Represent each word as a one-hot vector : a binary vector of length V (vocabulary size) with a single 1 and the rest 0s.

Example (for vocab = [apple, banana, cat]):

  • “apple” → [1, 0, 0]

  • “banana” → [0, 1, 0]

  • “cat” → [0, 0, 1]

  • These are standard basis vectors in ℝ^V.

Limitations:

  • Sparse and high-dimensional
  • No notion of similarity — all vectors are orthogonal
    • “cat” and “dog” are as unrelated as “cat” and “banana”

Improved Idea: Word Embeddings

  • Goal: Learn a continuous vector for each word that captures semantic relationships.

  • No longer treat words as isolated symbols. Instead, learn representations from *how words occur in context *.

How do we capture similarity between words?

We don’t use WordNet’s manual graph approach. Instead, we follow ideas from philosophy and linguistics:

  • Ludwig Wittgenstein :

“The meaning of a word is its use in the language.”

  • John Firth :

“You shall know a word by the company it keeps.”

This leads to the Distributional Hypothesis :

Words that appear in similar contexts tend to have similar meanings.

What is context?

Context can be defined in different ways:

  • Document
  • Paragraph
  • Sentence
  • Fixed-size window around a word

K-Window Context

A simple and popular approach:

  • For a word at position i, define its context as the words from (i - k) to (i + k), excluding the word itself.

Term-Context Matrix

  • Build a vocabulary-size × vocabulary-size matrix
  • Rows = target words
  • Columns = context words
  • Values = counts of how often each word appears in the context of another

How to compute similarity?

One simple approach: dot product between word vectors
  • Limitation 1: sensitive to vector length (scale)
  • Limitation 2: Biased by high-frequency words (e.g., “UNK”, “the”)
    • These words appear in many contexts and accumulate large vector norms
    • This can make them falsely appear highly similar to many words
A better approach: cosine similarity

cosine_similarity

  • Formula
  • Measures angle between vectors, not magnitude
  • Values range from -1 to 1
  • Captures semantic similarity more robustly
  • Limitation1: Still with long V, the wordvec is mostly sparse…

    Reduce dimension using SVD

  • Limitation2: Frequent but uninformative word, may make unrelated words seem to have similar context

    Use PPMI

Learnable WordVec

  • Counting-based word representations work, but they are static.
  • Meaning is bounded to observed co-occurrences only — no generalization.
  • Sparsity of count-based vectors makes them memory-inefficient and hard to scale.

So instead of memorizing co-occurrence counts, we learn word vectors that generalize meaning and compress information efficiently.

→ Continued on the next page: 2. Word2Vec