Glossary · 82 entries
Vocabulary.
Of contemporary research.
Bilingual technical glossary of terms central to contemporary academic research, focused on writing, data, statistics, and computational methods. Each entry includes application context, usage limits, and frequent pitfalls.
Entries in production. Check back soon.
Acknowledgments Manuscript section recognizing contributions not sufficient for authorship under ICMJE: funding, infrastructure, technical support, critical review, research service provision. Standard form to declare substantive contributions that do not meet the four authorship criteria.
Writing Algorithmic fairness ML subfield studying bias and discrimination in algorithmic systems, with formal criteria (demographic parity, equal opportunity, calibration) often in mathematical tension with each other. Barocas, Hardt, and Narayanan (2019) consolidated the reference.
AI/ML Analysis of variance (ANOVA) Analysis of Variance: classical statistical technique for comparing means across three or more groups. Established by Fisher in 1925, it forms the foundation of experimental design in biomedical, agricultural, and behavioral sciences.
Statistics Article Processing Charge (APC) Fee charged by gold or hybrid OA journals to process and publish an accepted article. Typically ranges from US$ 500 to US$ 12,000 depending on journal prestige. Can be paid by author, institution, funding agency, or via waiver.
Writing AUC-ROC Area Under the Receiver Operating Characteristic curve — discrimination metric for binary classifiers integrating performance across all decision thresholds. Hanley and McNeil (1982) formalized the probabilistic interpretation. Ranges from 0.5 (random) to 1.0 (perfect).
AI/ML BERT Pre-trained language model based on the Transformer architecture, developed by Google in 2018. Trained by *masked language modeling*, BERT established the pre-training + fine-tuning paradigm that dominated natural language processing until the generative LLM era.
AI/ML BERTopic Modern topic modeling algorithm combining contextual embeddings (BERT, Sentence-Transformers), dimensionality reduction (UMAP), clustering (HDBSCAN), and c-TF-IDF. Grootendorst (2022) consolidated. Often surpasses LDA in semantic coherence on small and medium corpora.
AI/ML Bibliometric analysis Quantitative mapping of a field's scientific output through article metadata: coauthorship networks, co-citation, temporal evolution, emerging fronts. Today relies on Scopus, Web of Science, and tools like VOSviewer and Bibliometrix.
Statistics Bootstrap Family of resampling-with-replacement methods that estimates the sampling distribution of an estimator from a single sample. Proposed by Efron (1979). Enables CIs and hypothesis tests without parametric normality assumptions.
Statistics CiteScore Bibliometric metric launched by Elsevier in December 2016, based on Scopus data. Computes citations received in one year to documents published in the four preceding years. Open, free, and covers more journals than JIF.
Writing Classification metrics Family of metrics for evaluating supervised classification models: accuracy, precision, recall, F1-score, AUC-ROC. Each captures a different aspect of the trade-off between false positives and false negatives. Powers (2011) synthesized the canonical framework.
AI/ML CLIP (Contrastive Language-Image Pre-training) Multimodal model pretrained by OpenAI that learns aligned image-text representations via contrastive learning on 400M web image-caption pairs. Radford et al. (2021). Enables zero-shot classification, image search, foundation for visual generative models.
AI/ML Cluster analysis Family of unsupervised methods that groups observations by similarity. Classical algorithms: k-means (MacQueen, 1967), hierarchical clustering, DBSCAN. Validation via silhouette (Rousseeuw, 1987), stability, and interpretability.
Statistics Confidence interval Range of values constructed from sample data which, under repeated use, contains the true population parameter with probability equal to the nominal confidence level (typically 95%). Formalized by Neyman in 1937.
Statistics Confirmatory factor analysis (CFA) Modeling technique that tests whether a hypothesized *a priori* factor structure fits observed data. Psychometric standard for validating measurement instruments with scales and items; established by Jöreskog in 1969 and implemented today in lavaan, Mplus, and AMOS.
Statistics Conflict of interest Situation where secondary interests (financial, personal, professional) may unduly influence judgment about primary interest (research rigor). Mandatory declaration in manuscripts via ICMJE form. Reporting does not eliminate; transparency is the defense.
Writing Convergent and discriminant validity Instrument validity criteria: convergent (items of the same construct correlate strongly) and discriminant (items of distinct constructs correlate weakly). Classical operationalization via AVE by Fornell and Larcker (1981) and HTMT by Henseler et al. (2015).
Statistics COPE (Committee on Publication Ethics) International nonprofit founded in 1997 that sets editorial ethics standards. Maintains Core Practices, Code of Conduct, and flowcharts for misconduct. Over 13,000 member journals and publishers. Operational reference in publication integrity.
Writing Cover letter Short document accompanying manuscript submission to a journal, addressed to the editor, articulating work relevance, fit with journal scope, and editorial declarations (originality, no parallel submission). Influences initial editorial triage.
Writing CRediT taxonomy Contributor Roles Taxonomy: international standard of 14 contribution categories in academic manuscripts, maintained by CASRAI/NISO. Replaces the generic notion of authorship with explicit role declaration. Adopted by more than 100,000 journals.
Writing Cronbach's alpha Classical coefficient of internal consistency for scales and instruments, proposed by Cronbach in 1951. Despite massive use in psychometrics, today widely criticized for restrictive assumptions — alternatives such as McDonald's omega are preferred.
Statistics Cross-validation Predictive model evaluation technique that partitions the dataset into k subsets, trains k times alternating which subset serves as validation, and reports the mean error. Standard for small datasets where a fixed train/test split is unstable.
AI/ML Effect size Quantitative measure of the magnitude of an observed effect or difference, independent of sample size. Includes the d (Cohen), r (correlation), and odds ratio families. A reporting component required by modern standards (DORA, ASA, APA, AMA).
Statistics Embeddings Dense vector representations of tokens — words, sentences, documents, images — in continuous fixed-dimensional space. Formalized in NLP by word2vec (Mikolov, 2013); today the foundation of semantic search, RAG, and most practical AI applications with text.
AI/ML Errata and corrigenda Editorial instruments correcting errors in published articles while preserving findings. Erratum: publisher error (composition, figure, typography). Corrigendum: author error (calculation, attribution, data). Distinct from retraction, which removes reliability.
Writing Exploratory factor analysis (EFA) Multivariate data reduction technique that identifies latent factors underlying a set of observed variables, without a priori hypothesis about structure. Typically precedes CFA in measurement instrument validation.
Statistics FAIR principles Set of four principles for research data management: Findable, Accessible, Interoperable, Reusable. Articulated by Wilkinson et al. (2016, Scientific Data). International standard adopted by the European Commission, NIH, and global funders.
Cross-cutting Feature engineering Set of practices for transforming raw data into informative features for ML models: encoding, normalization, derived feature creation, selection, reduction. Domingos (2012) articulated it as a central variable of practical ML performance.
AI/ML Fine-tuning Adaptation of a pre-trained model to a specific task or domain via additional training over smaller labeled data. The dominant paradigm in NLP between 2018 and 2022, still relevant for BERT and specialized variants in technical domains.
AI/ML Fine-tuning vs prompt engineering Applied comparison between two paradigms for adapting LLMs: fine-tuning (weight update with specific data) and prompt engineering (instruction design without modifying the model). Trade-off among cost, control, latency, and generalization.
AI/ML H-index Bibliometric metric proposed by Jorge Hirsch in 2005 combining productivity and impact: a researcher has h-index equal to h if they published h articles each with at least h citations. Widely used and widely contested in quantitative research evaluation.
Writing Human annotation and inter-annotator agreement Manual labeling of data (text, image, audio) by human annotators, basis of supervised datasets in ML. Inter-annotator agreement (IAA) is measured via Cohen's kappa (1960), Krippendorff's alpha. Essential quality criterion.
AI/ML ICMJE International Committee of Medical Journal Editors. International committee that defines editorial conventions for authorship, conflicts of interest, peer review, and integrity in scholarly production across biomedical and health sciences, with adoption in adjacent fields.
Writing Impact factor Ratio of citations received to citable items published in the two preceding years in a journal — a bibliometric metric created by Eugene Garfield in 1955 and published annually by JCR (Clarivate). Recognized for both its use and contestation (DORA 2012, CoARA 2022).
Writing Lattes Platform (CNPq) Integrated CNPq system maintaining curricula of Brazilian researchers, research groups (Directory), and institutions. National standard for academic evaluation, scholarship distribution, and funding. Operating since 1999.
Cross-cutting Linear regression Statistical model estimating the linear relationship between a dependent variable and one or more independent variables. Methodological foundation of much of applied statistics and pedagogical entry point for more complex predictive models.
Statistics LLM (Large Language Model) Language model with billions to trillions of parameters, trained on massive text corpora via the Transformer architecture. Immediate ancestors: BERT (2018) and GPT-2 (2019). Milestones: GPT-3 (2020), instruction-tuned models (2022), multimodal models (2023+).
AI/ML Logistic regression Statistical model for categorical dependent variable that estimates the probability of belonging to a category as a logistic function of predictors. Variants: binary, multinomial, and ordinal. Cox (1958) formalized it for binary response.
Statistics MANOVA Multivariate analysis of variance: extension of ANOVA to multiple dependent variables simultaneously. Tests whether group means differ considering correlation structure across outcomes. Test statistics: Wilks' Lambda, Pillai, Hotelling-Lawley, Roy.
Statistics Mediation and moderation Mediation: variable M explains HOW X affects Y (causal mechanism). Moderation: variable W modifies WHEN or FOR WHOM the effect of X on Y occurs (interaction). Distinction formalized by Baron and Kenny (1986); modern approach via Hayes (2018).
Statistics Missing data and multiple imputation Treatment of missing values in research data. Mechanisms: MCAR, MAR, MNAR. Multiple imputation (Rubin, 1987) generates m complete datasets via posterior sampling, combining estimates via Rubin's rules for valid inference.
Statistics Mixed-effects models (GLMM) Generalized models combining fixed effects (population parameters) and random effects (variation across groups/subjects). Appropriate for nested, longitudinal, or grouped data. Canonical R implementation via lme4 (Bates et al., 2015).
Statistics Natural language processing (NLP) Field of artificial intelligence and computational linguistics dedicated to representing, processing, and generating human language with computational systems. Spans from classical syntactic analysis to large-scale language models like BERT and GPT.
AI/ML Network analysis Family of methods to study relations among entities represented as nodes and edges. Central metrics: centrality (degree, betweenness, eigenvector), density, modularity, community detection. Wasserman and Faust (1994) is the classical reference.
Statistics Open Access Academic publishing model in which content is free and openly accessible to readers, with no subscription barrier. Exists in four main variants — gold, green, diamond, and hybrid — with different funding and licensing models.
Writing ORCID Unique persistent identifier for researchers, in a 16-digit format. Maintained by ORCID Inc., a nonprofit organization. Today required by most funders and journals as a condition for submission and grant award.
Cross-cutting Overfitting Phenomenon in which a machine learning model fits the training-set sampling noise excessively, losing generalization ability. Detected by the gap between training error (low) and test error (high). Underfitting is the opposite problem.
AI/ML P-value Probability of obtaining, under the null hypothesis, a test statistic at least as extreme as the observed value. Central metric in frequentist hypothesis testing. The ASA issued a formal statement in 2016 warning against common misinterpretations.
Statistics Peer review Central mechanism of scientific validation in which external reviewers evaluate a manuscript before publication. Modalities: single-blind, double-blind, open peer review, post-publication peer review. Structure inherited from the 18th century, formalized in the 20th.
Writing Plan S International initiative launched in 2018 by cOAlition S — a coalition of European and global research funders — requiring immediate, embargo-free open access for publications resulting from signatory funding. Full implementation since 2021.
Writing Predatory publishing Journal that charges APCs without offering rigorous peer review or legitimate editorial practices, exploiting authors and polluting the scientific literature. Term coined by Jeffrey Beall in 2010. Consensus definition in Grudniewicz et al. (2019, Nature).
Writing Preprint Version of an academic manuscript deposited in an open repository before or alongside submission to a journal. arXiv (1991) started the practice in physics; bioRxiv (2013), SciELO Preprints, and SSRN extended it to other fields. Receives DOI, is citable.
Writing Preregistration Formal deposit of hypotheses, methods, and analysis plan before data collection or analysis, in a repository with verifiable timestamp (OSF, AsPredicted). Distinguishes confirmatory from exploratory. Nosek et al. (2018) synthesized the revolution.
Cross-cutting PRISMA Preferred Reporting Items for Systematic reviews and Meta-Analyses: international guideline for reporting systematic reviews. Current version: PRISMA 2020 (Page et al., BMJ). 27-item checklist + flow diagram. Near-universal adoption in health.
Cross-cutting Propensity score matching Causal inference method in observational studies that matches treated and controls based on propensity score — estimated probability of receiving treatment given covariates. Rosenbaum and Rubin (1983) formalized. Reduces observable confounding bias.
Statistics PROSPERO International Prospective Register of Systematic Reviews, maintained by CRD (Centre for Reviews and Dissemination, University of York) since 2011. Registers systematic review protocols in health before initiation, with permanent timestamp and DOI. International standard.
Cross-cutting RAG (Retrieval-Augmented Generation) Retrieval-Augmented Generation: an architecture combining retrieval over an external document base with a language generation model. The current standard for answering questions with documentary grounding and reducing hallucination in LLMs.
AI/ML Reproducibility and replicability Reproducibility: obtaining the same results with the same data and code. Replicability: obtaining consistent results in an independent study with new data collection. Distinction formalized by Goodman et al. (2016) and adopted by the National Academies (2019).
Cross-cutting Research ethics committee Independent institutional body that ethically evaluates research projects with human participants. CEP/CONEP in Brazil, IRB in the US, REC in the UK. Foundations: Helsinki (1964), Belmont Report (1979), Beauchamp and Childress's principles.
Cross-cutting Response to reviewers Technical document accompanying a revised manuscript, responding point by point to reviewer comments with text modifications and justifications. Decisive for the revision outcome: accept, re-revise, reject.
Writing Retraction Formal removal of an article from the scientific record due to fundamental error, misconduct, or irreproducibility. Not erasure: the article remains with a visible retraction notice and active DOI. COPE defines the workflow. Retraction Watch monitors since 2010.
Writing Scientometric analysis Quantitative study of science as a system: production, collaboration, citations, impact, field dynamics. Differs from bibliometrics by broader scope (policies, national indicators). Methods: network analysis, text mining, temporal analysis.
Cross-cutting Scoping review Structured synthesis that maps literature on a broad topic, identifies key concepts, gaps, and types of evidence. Distinguished from systematic review by broader scope and absence of quality appraisal. Framework by Arksey and O'Malley (2005); reporting via PRISMA-ScR.
Cross-cutting Semantic and instance segmentation Computer vision tasks that classify each pixel of an image. Semantic segmentation assigns a class label per pixel (without distinguishing instances); instance segmentation distinguishes individual objects of the same class. mIoU is the standard metric.
AI/ML Sensitive data in research Data categories requiring extra protection: health, genetic data, sexual orientation, religion, financial status, geolocation. Regulated by LGPD (Brazil), GDPR (EU), HIPAA (US). Anonymization is not a final solution — re-identification is a growing risk.
Cross-cutting Sentiment analysis NLP subfield that classifies affective polarity (positive, negative, neutral) or identifies specific emotions in text. Approaches evolved from manual lexicons to supervised classifiers to transformer-based models. Pang and Lee (2008) consolidated the field.
AI/ML SHAP values SHapley Additive exPlanations: ML model interpretability framework that attributes each feature's contribution to an individual prediction via Shapley values from cooperative game theory. Lundberg and Lee (2017) unified prior methods.
AI/ML SJR (SCImago Journal Rank) Journal prestige indicator proposed by González-Pereira et al. in 2010. Applies a PageRank-derived algorithm to Scopus citations, weighting each citation by the prestige of the citing journal. Open, free, structural alternative to JIF.
Writing Statistical power Probability that a statistical test correctly rejects the null hypothesis when it is false, i.e., $1 - \beta$. Recommended minimum standard: 0.80. Cohen (1988) formalized sample size calculation based on power. Preregistration today requires a priori analysis.
Statistics Structural equation modeling (SEM) Family of multivariate techniques combining factor analysis and multiple regression to test networks of relationships between latent and observed variables. Standard in social, behavioral, and health sciences for validating complex theoretical models.
Statistics Sucupira Platform (CAPES) CAPES system for collecting data from Brazilian graduate programs (master's, doctoral). Foundation of the quadrennial evaluation: grades 3-7 that determine course recognition and scholarship distribution. Operating since 2014, replacing CAPES Coleta.
Cross-cutting Survival analysis Family of methods for time-to-event (death, recurrence, failure) with explicit handling of censored data. Kaplan-Meier estimator (1958) for the survival function; Cox model (1972) for hazard ratio regression.
Statistics Systematic review Structured synthesis of literature on a specific research question, with explicit, reproducible, and preregistered method. Identifies, appraises, and integrates relevant studies minimizing bias. PRISMA 2020 is the standard reporting guideline.
Cross-cutting Time series Family of statistical methods for time-ordered data, modeling trend, seasonality, autocorrelation, and noise. Classical decomposition X = T + S + R; canonical parametric models ARIMA (Box and Jenkins, 1976). Forecasting is the central objective.
Statistics Topic modeling (LDA) Latent Dirichlet Allocation: probabilistic generative model that discovers latent topics in a document corpus. Each document is a mixture of topics; each topic is a distribution over words. Blei, Ng, and Jordan (2003) consolidated the canonical classical NLP framework.
AI/ML Train/validation/test split Partitioning of a dataset into three disjoint subsets for machine learning: training (parameter fitting), validation (hyperparameter selection), and test (unbiased final evaluation). Methodological standard to avoid contamination.
AI/ML Transfer learning ML paradigm in which knowledge learned on a source task is transferred to a related target task, reducing labeled data and training time required. Pan and Yang (2010) consolidated the taxonomy. Foundation of pretrained-model use in modern deep learning.
AI/ML Transformer architecture Neural network architecture based exclusively on attention mechanisms, proposed by Vaswani et al. in 2017. Replaced recurrent networks in nearly every NLP task and became the structural foundation of BERT, GPT, Claude, Gemini, and the current generation of language models.
AI/ML