AI & MACHINE LEARNING

Embeddings

Dense vector representations of tokens — words, sentences, documents, images — in continuous fixed-dimensional space. Formalized in NLP by word2vec (Mikolov, 2013); today the foundation of semantic search, RAG, and most practical AI applications with text.

Extended definition

Embeddings are dense vector representations of discrete objects — words, sentences, documents, images, audio, code — in continuous fixed-dimensional space, typically between 256 and 4096 dimensions. The underlying premise, prior to modern models, is the distributional hypothesis: meaning emerges from patterns of co-occurrence in context. Each token ww is mapped to a vector vwRd\mathbf{v}_w \in \mathbb{R}^d, and semantic similarity between tokens is captured by vector proximity — usually measured by cosine similarity. The contemporary formalization begins with word2vec (Mikolov et al., 2013), which trained embeddings by context prediction at scale. GloVe (Pennington et al., 2014) offered an alternative based on factorization of the co-occurrence matrix. Contextual embeddings — in which the vector depends on the sentence in which a word appears — emerged with ELMo and became standard with BERT (Devlin et al., 2018). Today, embeddings produced by models such as Sentence-BERT, OpenAI text-embedding-3, and Cohere embed v3 are infrastructure for nearly every practical AI application with text.

When it applies

Embeddings are appropriate whenever the problem involves semantic similarity, retrieval by meaning rather than keyword, clustering of conceptually close items, or dense features for downstream models. Typical applications include semantic search over documents, RAG (retrieval-augmented generation) systems, zero-shot or few-shot classification, approximate record deduplication, scientific literature clustering, and topic detection.

When it does not apply

Embeddings are not appropriate for tasks requiring exact match — searching by case number, product code, unique identifiers — where classical inverted indexes (BM25, Lucene) are more precise and cheaper. They do not replace structured features in problems with strong signal in tabular variables — for risk classification with numerical data, gradient boosting typically outperforms approaches based on text embeddings. They do not work well with vocabulary outside the training distribution without adaptation — highly specialized terms or low-resource languages produce low-quality vectors.

Applications by field

Search and retrieval: foundation of semantic search over documents, Q&A engines, RAG over technical literature. — Bibliometrics and systematic review: clustering of papers, topic detection with BERTopic, citation deduplication. — Health: semantic search over biomedical literature, similarity between medical records for decision support. — Social sciences and digital humanities: discourse analysis over large corpora, conceptual mapping in text collections.

Common pitfalls

The first pitfall is assuming that high cosine similarity implies semantic equivalence — embeddings capture distributional co-occurrence, not full meaning; antonyms frequent in the same contexts may have very close vectors. The second is using a generic model in a specialized domain without adaptation — generic OpenAI or multilingual embeddings underperform on case law, clinical text, or specialized scientific literature. The third is ignoring bias: embeddings reflect associations in training data, including documented racial, gender, and regional stereotypes. The fourth is trusting comparisons between models without a domain-specific benchmark — performance varies dramatically across tasks. The fifth is neglecting computational cost: dense embeddings require storage and indexing (FAISS, Qdrant, Weaviate), with non-trivial costs at scale. The sixth is mixing embeddings from different models in the same space — vectors are not comparable across models.

Last updated —