AI AND MACHINE LEARNING · 35 entries

AI and Machine Learning.

Entries on computational methods applied to research: machine learning, natural language processing, computer vision, language models, and contemporary neural architectures.

Algorithmic fairness ML subfield studying bias and discrimination in algorithmic systems, with formal criteria (demographic parity, equal opportunity, calibration) often in mathematical tension with each other. Barocas, Hardt, and Narayanan (2019) consolidated the reference.

AUC-ROC Area Under the Receiver Operating Characteristic curve — discrimination metric for binary classifiers integrating performance across all decision thresholds. Hanley and McNeil (1982) formalized the probabilistic interpretation. Ranges from 0.5 (random) to 1.0 (perfect).

BERT Pre-trained language model based on the Transformer architecture, developed by Google in 2018. Trained by *masked language modeling*, BERT established the pre-training + fine-tuning paradigm that dominated natural language processing until the generative LLM era.

BERTopic Modern topic modeling algorithm combining contextual embeddings (BERT, Sentence-Transformers), dimensionality reduction (UMAP), clustering (HDBSCAN), and c-TF-IDF. Grootendorst (2022) consolidated. Often surpasses LDA in semantic coherence on small and medium corpora.

Class imbalance Situation in which the categories of a classification problem are not equally represented, with a majority class dominating the minority, usually the one of interest. Handled by resampling (SMOTE), cost, and the choice of an adequate metric.

Classification metrics Family of metrics for evaluating supervised classification models: accuracy, precision, recall, F1-score, AUC-ROC. Each captures a different aspect of the trade-off between false positives and false negatives. Powers (2011) synthesized the canonical framework.

CLIP (Contrastive Language-Image Pre-training) Multimodal model pretrained by OpenAI that learns aligned image-text representations via contrastive learning on 400M web image-caption pairs. Radford et al. (2021). Enables zero-shot classification, image search, foundation for visual generative models.

Cross-validation Predictive model evaluation technique that partitions the dataset into k subsets, trains k times alternating which subset serves as validation, and reports the mean error. Standard for small datasets where a fixed train/test split is unstable.

Diffusion models Family of deep generative models that synthesize data by inverting a noising process: they add Gaussian noise to data over many steps and train a network to undo it, generating samples from noise. The basis of modern image generation.

Embeddings Dense vector representations of tokens — words, sentences, documents, images — in continuous fixed-dimensional space. Formalized in NLP by word2vec (Mikolov, 2013); today the foundation of semantic search, RAG, and most practical AI applications with text.

Feature engineering Set of practices for transforming raw data into informative features for ML models: encoding, normalization, derived feature creation, selection, reduction. Domingos (2012) articulated it as a central variable of practical ML performance.

Fine-tuning Adaptation of a pre-trained model to a specific task or domain via additional training over smaller labeled data. The dominant paradigm in NLP between 2018 and 2022, still relevant for BERT and specialized variants in technical domains.

Fine-tuning vs prompt engineering Applied comparison between two paradigms for adapting LLMs: fine-tuning (weight update with specific data) and prompt engineering (instruction design without modifying the model). Trade-off among cost, control, latency, and generalization.

Generative adversarial networks (GANs) Generative models in which two networks compete: a generator produces samples from noise and a discriminator tries to separate real from generated. Training seeks a minimax equilibrium. They generate in a single step but suffer instability and mode collapse.

Gradient boosting Ensemble technique that sums many shallow trees trained in sequence, each fit to correct the previous ensemble's errors by approximating the negative gradient of the loss. The de facto standard on tabular data via XGBoost and LightGBM.

Hallucination Generation, by a language model, of fluent and plausible content that is factually incorrect or unsupported by the source. It is organized along two axes: intrinsic vs extrinsic and factuality vs faithfulness to the material provided.

Human annotation and inter-annotator agreement Manual labeling of data (text, image, audio) by human annotators, basis of supervised datasets in ML. Inter-annotator agreement (IAA) is measured via Cohen's kappa (1960), Krippendorff's alpha. Essential quality criterion.

LIME Local, model-agnostic explainability method: to explain an individual prediction of a black box, it fits a simple interpretable model in the neighborhood of that case, from perturbations weighted by proximity.

LLM (Large Language Model) Language model with billions to trillions of parameters, trained on massive text corpora via the Transformer architecture. Immediate ancestors: BERT (2018) and GPT-2 (2019). Milestones: GPT-3 (2020), instruction-tuned models (2022), multimodal models (2023+).

Natural language processing (NLP) Field of artificial intelligence and computational linguistics dedicated to representing, processing, and generating human language with computational systems. Spans from classical syntactic analysis to large-scale language models like BERT and GPT.

Overfitting Phenomenon in which a machine learning model fits the training-set sampling noise excessively, losing generalization ability. Detected by the gap between training error (low) and test error (high). Underfitting is the opposite problem.

RAG (Retrieval-Augmented Generation) Retrieval-Augmented Generation: an architecture combining retrieval over an external document base with a language generation model. The current standard for answering questions with documentary grounding and reducing hallucination in LLMs.

Random forest Ensemble method that combines many independent decision trees, built with bagging and random feature selection, and aggregates their predictions by vote or average. The randomness decorrelates the trees and reduces variance.

Reinforcement learning The third paradigm of machine learning: an agent learns by interacting with an environment, choosing actions and maximizing cumulative reward over time. Formalized by the Markov decision process; combined with deep networks, it becomes deep RL.

Semantic and instance segmentation Computer vision tasks that classify each pixel of an image. Semantic segmentation assigns a class label per pixel (without distinguishing instances); instance segmentation distinguishes individual objects of the same class. mIoU is the standard metric.

Sentiment analysis NLP subfield that classifies affective polarity (positive, negative, neutral) or identifies specific emotions in text. Approaches evolved from manual lexicons to supervised classifiers to transformer-based models. Pang and Lee (2008) consolidated the field.

SHAP values SHapley Additive exPlanations: ML model interpretability framework that attributes each feature's contribution to an individual prediction via Shapley values from cooperative game theory. Lundberg and Lee (2017) unified prior methods.

Synthetic data Artificially generated data that reproduces the statistical properties of a real dataset without exposing the original records. Quality is assessed along three dimensions in tension: fidelity, utility, and privacy.

Tokenization Process that converts raw text into the sequence of discrete units (tokens) a language model processes. Current models use subwords (BPE, SentencePiece), a middle ground between whole word and character that sets cost and vocabulary coverage.

Topic modeling (LDA) Latent Dirichlet Allocation: probabilistic generative model that discovers latent topics in a document corpus. Each document is a mixture of topics; each topic is a distribution over words. Blei, Ng, and Jordan (2003) consolidated the canonical classical NLP framework.

Train/validation/test split Partitioning of a dataset into three disjoint subsets for machine learning: training (parameter fitting), validation (hyperparameter selection), and test (unbiased final evaluation). Methodological standard to avoid contamination.

Transfer learning ML paradigm in which knowledge learned on a source task is transferred to a related target task, reducing labeled data and training time required. Pan and Yang (2010) consolidated the taxonomy. Foundation of pretrained-model use in modern deep learning.

Transformer architecture Neural network architecture based exclusively on attention mechanisms, proposed by Vaswani et al. in 2017. Replaced recurrent networks in nearly every NLP task and became the structural foundation of BERT, GPT, Claude, Gemini, and the current generation of language models.

YOLO (You Only Look Once) Family of real-time object detection models that reformulated detection as direct regression of bounding boxes and classes in a single network pass. Redmon et al. (2016). IoU is the central metric; mAP evaluates global performance.

Zero-shot and few-shot learning Regimes in which a model solves a task with no labeled examples of the target class (zero-shot) or very few (few-shot). In language models, they take the form of in-context learning, with the task specified in the prompt itself.