Bibliometric analysis as empirical thesis argument

A thesis introduction often opens with “gaps remain in the literature on X.” The sentence is dear to qualifying committees for a precise reason: it is rarely supported by empirical evidence from the author. The demonstration that a gap exists usually comes from subjective reading — the author read fifty papers and did not see topic X treated the way Y they propose. For a panel of specialists, that argument is fragile. They may have read fifty other papers and seen topic X covered extensively.

Bibliometric analysis addresses this problem with method. Instead of asserting the existence of a gap, bibliometrics demonstrates it empirically: how many articles were published on topic X in each temporal window, which authors and institutions lead production, which thematic clusters emerge in the co-citation network, and what is absent from those dimensions — which is, precisely, the gap.

The genealogy that matters

Bibliometrics was formalized by Alan Pritchard in 1969, in the Journal of Documentation, as “the application of mathematical and statistical methods to books and other media of communication.” The definition seems prosaic until one considers what preceded it. Lotka (1926), in a seminal paper in the Journal of the Washington Academy of Sciences, described the inverse-square distribution of scientific productivity: the proportion of authors producing n articles is approximately 1/n² of the proportion producing just one. Bradford (1934) described the law of literature scatter — a small core of journals concentrates most articles relevant to a topic. Zipf (1949) described the inverse relation between rank and frequency in word corpora. Garfield (1955), in Science, introduced the citation index that would make bibliometric analysis operational at scale.

These anchors are not historical curiosity. They define what can be empirically demonstrated about the structure of any scientific corpus. Bibliometric analysis conducted without reference to this genealogy is often descriptive exercise — lists of most-cited, keyword counts. Analysis conducted with the genealogy is structural argument.

Bar chart showing typical author productivity distribution in scientific corpus following Lotka's law; 60% of authors contribute only one article — Typical distribution of author productivity in scientific corpus, following Lotka's law (1926). The critical reading is the contrast between long decay and concentration at the right extreme. The five-or-more-publications category, highlighted, contains the authors whose systematic production defines the structure of the field — these are the canonical readings a serious manuscript cannot ignore. Pattern consistent with Souza, Kuniyoshi, and Freitas (2024) in an ESG corpus (1,574 authors, 699 articles) and Hoang (2025) in methodological review.

What serious bibliometrics delivers

Serious bibliometrics is not counting. It is structural mapping with three specific deliverables. The first is the temporal production curve: how many articles per year on the topic, indicating whether the field is young, mature, or declining. A young field with few papers per year admits original contribution from more angles; a mature field with high volume demands more precise positioning.

The second deliverable is the co-citation network. When two papers are cited together in a third, an edge exists between them in the network. Dense clusters in the network correspond to consolidated intellectual traditions in the field. Identifying clusters allows the author to position their contribution against a specific tradition rather than against the literature as a whole. It is less ambitious and infinitely more defensible.

The third deliverable is the identification of dominant authors and institutions via Lotka’s law applied to the corpus. The Lotka pyramid in a mature field has two or three authors accounting for a disproportionate fraction of production. Not citing those authors is an obvious vulnerability in peer review. Citing requires knowing their work, which alters subsequent reading by the candidate.

Delivery in a real manuscript

In an empirical manuscript, bibliometrics does not replace theoretical grounding — it enters as a short section preceding it, typically with two to four paragraphs and one or two figures. The first figure is the temporal production curve; the second, when justified, is the co-citation network map with identified clusters. The text accompanying the figures does three things: identifies the clusters, positions the research in one of them, and empirically justifies why the specific gap the study addresses exists in that cluster and not in another.

Corpus construction requires methodological decisions that must be declared: database used (Scopus, Web of Science, or both with duplicate treatment), exact search strings, filters applied (document type, language, temporal window), and final number of articles analyzed. Without these declarations, bibliometrics loses traceability and the empirical argument weakens.

Bibliometric analysis as empirical thesis argument

The genealogy that matters

What serious bibliometrics delivers

Delivery in a real manuscript

References

This analysis reflects Aria's practice in Bibliometric Analysis and Revision and Rewriting.

The genealogy that matters

What serious bibliometrics delivers

Delivery in a real manuscript

References

This analysis reflects Aria's practice in Bibliometric Analysis and Revision and Rewriting.

Measurement invariance in translated instruments

Multilevel modeling: when MLM is required and when OLS suffices

A p-value alone won't cut it: what Q1 reviewers read in your results section