LDA vs. BERTopic in academic corpora

The choice of method for topic modeling in academic corpora underwent, during the 2020s, a quiet reconfiguration. LDA, which had been the operational reference since the seminal paper by Blei, Ng, and Jordan (2003), began competing with a fundamentally different approach: BERTopic, proposed by Grootendorst (2022). Most manuscripts using topic modeling do not articulate clearly why they chose one or the other, which typically becomes a reanalysis request in peer review.

The choice is not stylistic. LDA and BERTopic operate over different representations of text, make different assumptions about topic structure, and produce outputs that carry different meaning. Knowing that difference is a prerequisite for defending the choice in manuscripts where topic modeling is the central method.

How each operates

LDA is a probabilistic generative model. The underlying assumption is that each document is generated from a mixture of topics, and each topic is a probability distribution over the vocabulary. The model estimates these distributions by maximizing the likelihood of observed data via Gibbs sampling or variational inference. The text representation is bag-of-words: frequency of each term in each document, without order, without context, without semantics.

BERTopic is a three-stage pipeline that operates over dense text representations. Documents are converted into high-dimensional embeddings via a pretrained BERT-style model (typically Sentence-BERT). Those embeddings are dimensionally reduced by UMAP. The result is clustered by HDBSCAN, and the textual representation of each cluster is generated via a class-based variation of TF-IDF (c-TF-IDF). The result is a set of clusters of semantically related documents, each cluster accompanied by representative terms.

The philosophical difference is central. LDA models each document as a mixture of topics — a paper on “deep learning applied to medical imaging” might have 60% probability on the deep-learning topic, 30% on the medical-imaging topic, 10% on others. BERTopic assigns each document to a single cluster — the same paper would be in a specific cluster whose textual description would capture “deep learning in medical imaging” as a unified topic.

When each is the right choice

LDA remains superior in three specific scenarios. The first is when lexical coherence based on term co-occurrence is the main interpretive criterion. Because LDA operates directly on frequencies, metrics like Cv coherence based on word co-occurrence tend to favor LDA (Röder, Both, and Hinneburg 2015). The second is when the corpus is large, the vocabulary is stable, and there is analytical value in modeling a topic mixture per document. The third is when the number of topics is known a priori or easily estimable by criteria such as perplexity or coherence.

BERTopic is the more defensible choice in three complementary scenarios. The first is when the corpus is heterogeneous in vocabulary but semantically coherent — papers addressing the same phenomenon using different terminology will be grouped together by semantic similarity, something LDA cannot do. The second is when documents are short — abstracts, titles, academic tweets — where LDA’s bag-of-words contains little information per document. The third is when interpretability of clusters via representative terms matters more than the explicit probability of each document belonging to each topic.

Bar chart comparing qualitative profiles of LDA and BERTopic across three dimensions: lexical coherence, semantic coherence, and robustness in short corpora — Qualitative comparative profiles of LDA and BERTopic across three evaluative dimensions relevant to academic corpora. Based on comparisons documented in Ma and colleagues (2025) on 1,837 PubMed abstracts concerning opioid-related cardiovascular risks in women, and in Babalola, Ojokoh, and Boyinbode (2024) on news headlines. Lexical coherence based on co-occurrence tends to favor LDA; semantic coherence and clustering of short documents tend to favor BERTopic. The choice between them depends on the dimension relevant to the analytical objective.

The evaluation trench

Coherence is the metric most used to compare topic models. The problem is that several coherences exist, and they measure different things. Cv measures term coherence based on sliding co-occurrence windows. UMass measures coherence via empirical conditional probability. NPMI measures normalized mutual information. In documented comparisons, LDA frequently wins on Cv while BERTopic wins on UMass and NPMI on the same corpus — not because one is better, but because the metrics evaluate different aspects of what constitutes “coherence.”

The operational consequence for a manuscript is that comparing LDA and BERTopic via a single metric is insufficient. Serious comparison reports multiple coherences, reports also human evaluation on a subsample (annotation by domain experts on relevance and interpretability of extracted topics), and justifies which criterion prevails for the analytical objective.

The rule that reduces the choice to practice

The operational rule that works for typical academic corpora is: if the goal is to describe the thematic distribution of an extensive full-text corpus with stable disciplinary vocabulary, start with LDA and evaluate Cv. If the goal is to cluster short abstracts coming from heterogeneous databases, or to explore the semantic structure of an interdisciplinary corpus, start with BERTopic and evaluate via UMass plus human inspection. In manuscripts that justify topic modeling as a central method, running both and reporting both outputs is the path that sustains methodological discussion in peer review.

LDA vs. BERTopic in academic corpora

How each operates

When each is the right choice

The evaluation trench

The rule that reduces the choice to practice

References

This analysis reflects Aria's practice in NLP and Text Mining and Complete Data Science Pipeline.

How each operates

When each is the right choice

The evaluation trench

The rule that reduces the choice to practice

References

This analysis reflects Aria's practice in NLP and Text Mining and Complete Data Science Pipeline.

Semantic embeddings for systematic review screening

AUC 0.95 won't publish in Q1: what reviewers read in medical computer vision manuscripts