AI and machine learning

Embeddings and Cultural Bias: What Pretrained Models Learn and Forget

An embedding is a compressed imprint of the text that trained it: it learns the culture of that corpus, with its stereotypes and its silences. Pretrained does not mean neutral. For under-represented populations there are two failures: the encoded stereotype and the thin representation. And the bias is measurable: on a health benchmark, a biomedicine model encoded stronger ethnic associations than a legal one.

An embedding is a compressed imprint of the text that trained it. By representing each word as a vector whose position summarizes which other words it tends to appear with, the model learns the culture of the corpus: its useful associations, its stereotypes, and its silences too. That is why the word pretrained should not be read as a synonym for neutral. A model that arrives ready arrives loaded with the regularities of the text it came from, and those regularities include how a society speaks, or fails to speak, about its under-represented groups. A reviewer who sees a study resting on embeddings asks, before the results, what that model learned about the populations in question.

The first thing to establish is that this bias is not an impression, it is a quantity. Caliskan and colleagues (2017)2 introduced the embedding association test and showed that off-the-shelf models reproduce documented human biases, from benign associations to racial and gender stereotypes, with effect sizes measured on the same scale as a psychological test. Garg and colleagues (2018)3 took the measurement through time, training embeddings on a century of text and showing that the associations for women and ethnic minorities track real demographic and occupational change. Charlesworth and colleagues (2021)4 found the same gender stereotypes in corpora as different as children’s speech and adult media, evidence that the learned bias is a stable property of the language, not an accident of one dataset.

For under-represented populations the problem has two faces. One is the stereotype the model actively encodes; the other is the thin representation, what it simply never learned because the corpus barely spoke of that group. The two failures compound: where there is little text, the representation is unstable and easily dominated by the majority stereotype. Durrheim and colleagues (2023)5 review how embeddings yield valid and reliable estimates of bias along bipolar dimensions, detecting subtle prejudices that are not stated openly, and it is that validity which lets us treat the problem as measurable rather than rhetorical. What the model fails to encode about an under-represented group is as consequential as what it does.

How much of this is measured bias, and where it concentrates, has been quantified recently. Gray and Wu (2025)1 measured SD-WEAT scores, a variant of the association test, for pretrained models on health benchmarks tied to sensitive populations.

Grouped bar chart of SD-WEAT bias scores on two ethnicity benchmarks: BioBERT scores 0.844 and 0.868, LegalBERT scores 0.348 and 0.663; higher is more bias.
SD-WEAT bias score on two ethnicity benchmarks, by model, from the Gray and Wu (2025) measurement. The biomedicine-specialized model (BioBERT) encodes stronger associations than the legal model (LegalBERT) on both benchmarks covering under-represented populations.

The reading undoes a common assumption. A model specialized in a technical domain is expected to be cleaner, more focused, less contaminated by social stereotype. The data show the opposite: on the ethnicity benchmarks, BioBERT, trained on biomedical text, scored 0.844 and 0.868, above LegalBERT at 0.348 and 0.663. Specialization does not dilute cultural bias; when the domain corpus carries the same asymmetries as the society that produced it, specialization can concentrate the bias precisely on the groups that appear least, and the cleaner the domain looks, the more easily that concentration goes unnoticed. Choosing a pretrained model for its domain does not excuse the researcher from measuring what that model learned about the people the study will touch.

What makes this bias dangerous is not its presence in the vector, it is its propagation. An embedding is rarely the final product; it feeds classifiers, retrieval systems, triage models, and text generators, and each of them inherits the association the vector carried. When a clinical triage model uses representations that tie certain groups to certain conditions more strongly than the evidence warrants, the bias stops being a research curiosity and starts shaping decisions. Garg and colleagues (2018)3 already showed that the association in the embedding mirrors the social structure of the corpus; the trouble is that the downstream system treats that association as knowledge rather than as the historical residue of a text. That is why measuring the bias at the source, before it dissolves into layers of model, is the only intervention that can still see where it came from.

The practical consequence is not to abandon embeddings but to audit them before trusting them. Measure the bias per group with an association test, rather than assuming neutrality from the model’s technical origin. Examine the provenance of the training corpus, because under-representation in the text becomes under-representation in the vector. Evaluate each population relevant to the study separately, since an aggregate score hides the group the model represents worst. When bias mitigation is applied, verify that it reduced the measured association rather than merely masking it, because shallow debiasing tends to displace bias without removing it. And report that audit in the methods section, with the per-group scores, the way any other property of an instrument is reported. An embedding is a measurement instrument that carries the culture of whoever wrote it; using it without measuring that load hands the reader a result whose most sensitive part went uninspected.

References

  1. Gray, M.; Wu, L. (2025). Benchmarking bias in embeddings of healthcare AI models: using SD-WEAT for detection and measurement across sensitive populations https://doi.org/10.1186/s12911-025-03102-8
  2. Caliskan, A.; Bryson, J. J.; Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases https://doi.org/10.1126/science.aal4230
  3. Garg, N.; Schiebinger, L.; Jurafsky, D.; Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes https://doi.org/10.1073/pnas.1720347115
  4. Charlesworth, T. E. S.; Yang, V.; Mann, T. C.; Kurdi, B.; Banaji, M. R. (2021). Gender Stereotypes in Natural Language: Word Embeddings Show Robust Consistency Across Child and Adult Language Corpora of More Than 65 Million Words https://doi.org/10.1177/0956797620963619
  5. Durrheim, K.; Schuld, M.; Mafunda, M.; Mazibuko, S. (2023). Using word embeddings to investigate cultural biases https://doi.org/10.1111/bjso.12560

This analysis reflects Aria's practice in NLP and Text Mining and Computer Vision.

If your project is at a point where this kind of reading is useful, consider submitting the manuscript or data for a technical diagnosis within 48 business hours.

Request a quote