Tracking Word Meaning Evolution in Vector Space

The Motivation

Words change meaning over time. This is not a remarkable observation. Linguists have been documenting semantic shift for centuries, and computational approaches dating back to word2vec have tracked how word meanings evolve across different time corpora. What is less explored is whether modern neural embedding models can reveal finer-grained patterns of semantic change, particularly at the single-token level.

The motivating question is deceptively simple: if we can represent a word's meaning as a point in a high-dimensional vector space, can we plot how that point moves from decade to decade, and does the trajectory tell us something meaningful about how language changes?

The Problem: We Cannot Extract Per-Word Embeddings

Modern embedding models output representations for entire sequences, not individual words. The standard architecture processes a text through a transformer encoder, aggregates the per-token representations through mean pooling or a [CLS] token, and returns a single dense vector for the entire input. This vector represents the meaning of the text, not the meaning of any specific word within it.

For most use cases, this is the correct design. Sentence-level or document-level embeddings are what retrieval systems need. But for the task of tracking word-level meaning evolution, it is insufficient. We cannot take two sentences containing the word "compute" from different decades, embed each sentence, and attribute the difference in vectors to the word "compute" itself. The vector difference could be caused by any word in either sentence.

The problem is compounded by tokenization. A word in human language may map to multiple subword tokens. The word "computing" might tokenize to [compute, ##ing] or to a single token, depending on the vocabulary. Even when a word maps to a single token, that token's representation in the transformer is shaped by the surrounding context through self-attention, making it inseparable from the sentence it appears in.

A Proposed Solution: Remove Mean Pooling

The simplest approach to extracting per-word embeddings is to remove the final mean pooling or [CLS] aggregation layer of the embedding model and instead take the raw per-token representations from the last transformer layer.

In this setup, given a sentence, the model outputs a set of vectors — one per input token. Each vector represents the contextualized meaning of that specific token, conditioned on the full sentence. If the token maps to a single word, we have a per-word embedding. If it maps to multiple subword tokens, we can aggregate them (e.g., by averaging or concatenation) to recover a per-word representation.

This is not a radical idea. It is essentially the output of BERT or any transformer encoder before pooling. What makes it interesting in this context is the question of whether these representations are stable enough across different sentences to support longitudinal comparison.

Reranking as a Proxy

An alternative approach is to use a cross-encoder reranking model. A reranker takes two sequences as input and outputs a similarity score. If one input is a single word and the other is a context sentence from a given era, the reranker score measures how well that word fits the context of that era.

Formally, for a word \(w\) and a set of corpus sentences \(\{s_1, s_2, \dots, s_n\}\) from different time periods, we can compute:

\(score(w, s_t) = reranker(w, s_t)\)

where \(t\) indexes the time period. This score does not produce a fixed embedding for \(w\), but rather a function that maps contexts to a relevance value. Tracking this function across time periods reveals how the word's fit to its typical contexts changes.

This approach has the advantage of being model-agnostic — it works with any cross-encoder reranker. The disadvantage is that it produces a score, not a point in a shared vector space, making visualization and clustering more difficult.

The Semantic Trajectory Hypothesis

If either of these approaches works, the result should be a trajectory: a sequence of points in vector space (or a sequence of score distributions), indexed by time, that traces how a word's meaning has shifted.

This trajectory could reveal several patterns:

Gradual drift. Most words change slowly. A continuous trajectory would capture the direction and speed of semantic change.
Abrupt shifts. Cultural events, technological innovations, and social movements can cause sudden redefinitions. The trajectory should show breakpoints.
Bifurcation and polysemy. Some words develop multiple concurrent meanings over time. The trajectory would not converge to a single point but would spread into a cluster.
Convergence. Related concepts may have different lexical forms initially and converge to the same meaning. Word pairs that were distant should become closer.

Implementing this at scale would require a historical corpus with reliable date annotations, a stable embedding model (or a single model evaluated across all time periods), and a strategy for aggregating per-token representations across many contextual sentences.

Data Sources

The quality of the result depends entirely on the quality of the corpus. The ideal data source is a large, chronologically annotated text collection — something like the Google Books N-gram corpus, the Oxford English Dictionary historical quotations, or a curated archive of dated publications. Each sentence or passage should be reliably associated with a publication date.

For Chinese, this is harder. Historical Chinese text corpora with reliable chronological annotations are less common and more difficult to tokenize consistently across time periods due to the evolution of written conventions.

This remains an exploration at the idea stage. The technical approach is straightforward. The data collection is the bottleneck. But if executed well, the resulting trajectories would provide a new empirical window into how language changes — not through hand-labeled dictionaries or qualitative analysis, but through quantitative patterns in high-dimensional vector representations.