2 Machine Learning

The rise of learning machines

Hand-coded rules broke on contact with the real world’s mess: too brittle, too blind to common sense, too many rules to ever write down by hand. Machine learning is the other answer. Instead of writing the instructions, you let a system find them in data.

This is where Machine Learning (ML) emerges. ML is a field within AI that allows computers to “learn” from data, finding patterns, making predictions, and even making decisions without being explicitly programmed for every single scenario. It’s the answer to problems that are too complex for us to write down precise, step-by-step instructions. For example, it’s the technology behind your phone’s face recognition, the recommendations you get on streaming services, or how spam is filtered from your inbox. Instead, we design systems that can discover these instructions themselves by observing examples and experiences. From a techno-pragmatist perspective, Machine Learning is a human-designed approach to extend AI’s capabilities into previously “unsolvable” domains, a reminder that we still choose how these tools are built.

What is machine learning?

To better understand this idea, it’s helpful to consider a formal definition that captures its essence. We turn to a widely accepted definition by AI researcher Tom Mitchell. He states that “Machine Learning is a computational approach to problem-solving with four key ingredients: A task to solve (T), a performance metric (M), a computer program (P), and a source of experience (E).” According to Mitchell, “You have a Machine Learning solution when: The performance of program P at task T, as measured by M, improves with access to the experience E.”

This definition highlights the core essence of ML: improvement through experience. In traditional software, a program’s performance is fixed by its initial design; it doesn’t get “smarter” the more you use it. In stark contrast, an ML program is specifically designed to automatically get better at its task as it processes more data or gains more experience. This fundamental shift allows us to tackle problems that were once beyond the reach of explicit programming.

How machines learn

This is the statistical side of the symbolic-versus-statistical split, examined from the inside: the various ways machines actually learn from data. The paradigms differ by the type of experience available, the quality of feedback they receive, and the strategies they employ to learn.

There are three classical paradigms — supervised, unsupervised, and reinforcement learning. A fourth, self-supervised learning, became the engine behind every modern language model; it appears briefly under the hybrid approaches below, and in full once masking and next-token prediction make the idea concrete a couple of chapters from here.

Every machine learning process, regardless of its specific flavor, relies on a combination of three essential factors. First, there is the Nature of Experience, which refers to the kind of data or environment the machine interacts with. This experience can be Static, like a fixed collection of books, images, or recorded observations (datasets), where the machine learns from what’s already provided without influencing the data itself. Alternatively, it can be Dynamic, involving an environment where the machine’s actions influence the experience it receives, often seen in simulations or real-world interactions.

Second, the Quality of Feedback describes how clear and direct the guidance is that the machine receives about its performance. Explicit Feedback provides direct and precise information about whether an action was correct or incorrect, such as a teacher telling you exactly how to solve a math problem. Conversely, Implicit Feedback offers indirect cues that suggest whether the machine is doing well or poorly, like kicking a soccer ball and observing its trajectory. Challenges often arise when feedback is Delayed, meaning the outcome is known much later (like losing a chess game), or Sparse, where feedback is available for only a small subset of decisions.

Finally, Learning Strategies are the actual mechanisms by which the machine processes experience and feedback to improve. Imitation, also known as Learning by Demonstration, involves observing correct behavior and trying to reproduce it, making it highly efficient but requiring explicit, high-quality examples. Trial-and-Error involves trying different actions, observing the consequences, and then adjusting behavior based on whether the outcomes are better or worse; this method is slower but effective when explicit instructions are unavailable. Lastly, Pattern Recognition focuses on finding hidden regularities or structures within data, often without direct guidance on what those patterns should be.

Learning by example, with a teacher

Supervised Learning is the most common and intuitive machine learning paradigm. It’s based on imitation learning and relies on static experience with the most explicit kind of feedback. In this setup, the machine is given a dataset of “input/output pairs.” The task is to learn a function or mapping that can produce the correct output for any given input.

This approach is fundamentally a problem of prediction: given a new, unseen input, the system predicts its corresponding output. The core assumption here is that there’s a computable relationship between the input and output that can be inferred from a sufficiently large number of examples. Supervised learning problems typically fall into two main categories: Classification and Regression.

Classification is used when the output is a category chosen from a finite set. This can involve Binary Classification, where the system predicts one of two categories (e.g., “spam” or “not spam,” “cancer” or “no cancer”). It also includes Multi-class Classification, which predicts one of more than two categories (e.g., identifying the type of animal in an image: “dog,” “cat,” “bird”). Furthermore, Multi-label Classification predicts zero, one, or more than one category for a single input, such as identifying all topics present in a news article.

Regression, on the other hand, is applied when the output is a continuous numerical value. A typical example is predicting the price of a house based on its features like the number of rooms, location, or size. It can also be used for forecasting stock prices.

Learning by discovery, without labels

In contrast to supervised learning, Unsupervised Learning deals with situations where we don’t have access to the “right outputs” or labels. Instead, the machine is given a large static dataset, and its primary goal is to find hidden patterns, structures, or relationships within that data. Here, feedback is highly indirect and often implicitly defined, and the main learning strategy is pattern recognition.

The underlying assumption is that there’s some inherent organization or regularity in the data that can be summarized or explained with a smaller amount of information than simply listing all elements. The human designer still shapes the outcome by providing the algorithm’s inductive bias—the type of regularities or structures it should look for. For example, when grouping customers, the human might decide the algorithm should prioritize grouping by age rather than by location, thereby introducing a specific preference for how patterns are discovered. This type of design bias is distinct from the harmful societal biases that models can unfortunately learn from biased data, which we will discuss later. Two common sub-problems in unsupervised learning are Clustering and Dimensionality Reduction.

Clustering involves grouping similar elements together based on their characteristics. The machine defines these groups without being told what the groups are beforehand. For example, this could mean grouping customers based on their purchasing behavior or segmenting news articles by topic without predefined categories.

Dimensionality Reduction aims to find a more compact or simplified representation of each element while retaining its most essential characteristics. This is often used as a preprocessing step to make data easier for other ML algorithms to handle. An example could be reducing a high-resolution image to a lower-resolution version that still captures its key visual information, or simplifying complex customer data into a few key “dimensions” that describe their preferences.

Learning by doing, from rewards

The third fundamental machine learning paradigm is Reinforcement Learning (RL). Unlike supervised or unsupervised learning, RL primarily involves dynamic simulations where an “agent” (the machine learning program) can interact with an environment and learn to make decisions that lead to desirable outcomes. This approach is fundamentally based on learning by trial and error.

Reinforcement Learning is particularly useful when the problem involves a sequence of actions, and there’s no obvious way to define the “correct” sequence beforehand other than through experimentation. The machine learns by receiving a “payoff” or “reward” after performing an action or a sequence of actions, and its objective is to maximize the total accumulated payoff over time. A major challenge in RL is credit assignment: deciding which specific actions in a long sequence contributed to the ultimate reward, especially when feedback is delayed (you only know if your strategy was good at the end of a game) and sparse (rewards are infrequent).

For instance, RL is used for training an AI to play a video game, where it learns to move, jump, and attack by receiving points (rewards) for achieving goals and penalties for mistakes. Other prominent examples include training robots to perform complex tasks or developing self-driving car algorithms that learn to navigate traffic in dynamic environments.

Hybrid learning approaches

While supervised, unsupervised, and reinforcement learning are the three foundational paradigms, many real-world applications and advanced techniques blur these lines. They combine elements from different approaches to achieve more powerful results.

One such blend is Semi-supervised Learning. This approach is used when you have a small amount of explicitly labeled data (input/output pairs) but a vast amount of unlabeled data (just inputs). The machine leverages the unlabeled data to learn general structures or patterns (unsupervised) and then uses the small amount of labeled data to refine its predictions for specific tasks (supervised). This method is particularly useful when obtaining labeled data is expensive.

Another hybrid is Self-supervised Learning. In this paradigm, the machine learns by creating its own “supervision” from the input data itself. The “output” it tries to predict is simultaneously part of the input, but hidden or masked. A common form involves predicting a missing word in a sentence based on the surrounding words. The machine generates its own labels from the vast amount of text data available, making it highly scalable like unsupervised learning, but with the explicit prediction goal of supervised learning.

Finally, Active Learning represents a different kind of hybrid, mixing traits of supervised and reinforcement learning. Here, the machine has access to a vast, potentially infinite amount of unlabeled data and an “oracle” (often a human expert) who can provide the correct output for any specific input. The key is that the machine actively chooses which unlabeled examples to ask the human to label, aiming to select the most informative examples that will maximize its learning while minimizing the human effort required.

From data to generalization

Once we’ve chosen a learning paradigm and a type of model, the machine needs to be “trained.” This process involves feeding the model data and adjusting its internal workings so it can perform its task effectively. The ultimate goal of training is not just for the model to perform well on the data it has seen, but to generalize – to work effectively on new, unseen data.

Preparing the data

To properly evaluate a machine learning model’s performance and ensure it can generalize, we always split our available data into three distinct sets. The Training Set is the largest portion of the data, used to “teach” the model, where it learns patterns and relationships. The Validation Set is used during the training process to tune the model’s settings, known as “hyperparameters,” and to make decisions about when to stop training. This set acts as a “practice exam” that the model takes periodically to check its progress, but its performance on this set doesn’t directly influence the final score.

The Test Set is what finally tells you the model’s true performance. It consists of data that the model has never seen before during any part of its training or tuning. This set is used only once at the very end to give an unbiased estimate of how well the model will perform in the real world on new data. To measure performance, we use evaluation metrics (M, from Mitchell’s definition). For classification tasks, this might be accuracy (the percentage of correct predictions), while for regression, it might be a measure of error (how far off the predictions are from the actual values). Sometimes, the direct “success” metric we care about (like “winning a chess game”) is hard to optimize directly, so we use a proxy metric (like “guessing if a given board appears on a winning game”) that is easier for the machine to learn from.

The challenge of robust learning

During the training process, two major challenges constantly arise, representing the core struggle of ensuring a model generalizes well. Underfitting occurs when the model is too simple or hasn’t learned enough from the training data to capture the underlying patterns. It performs poorly on both the training data and any new data. Think of a student who hasn’t studied enough for an exam – they perform badly on the practice questions and the actual test. In this scenario, the model is too basic to understand the complexity of the problem.

Conversely, Overfitting is arguably the more insidious problem. It happens when the model learns the training data too well, including its noise and random fluctuations. It memorizes quirks that aren’t representative of the broader problem. An overfit model will perform exceptionally well on the training data but poorly on new, unseen data.

Imagine a student who memorized only the practice exam questions and their answers, but doesn’t understand the underlying concepts; they ace the practice test but fail the real exam when the questions are slightly different. The model is too complex or too specialized for the training data, failing to capture a generalizable pattern.

The ultimate goal of training is to find a delicate balance: a model that is complex enough to capture the true patterns in the data (avoiding underfitting) but simple enough to ignore the noise and generalize well to new data (avoiding overfitting).

To help models achieve good generalization and avoid the pitfalls of underfitting and overfitting, machine learning practitioners use several key techniques. Regularization refers to a set of methods that “penalize overly complex formulas” during the training process. By adding a “cost” for complexity, regularization encourages the model to find “smoother formulas that are more likely to generalize to unseen data.” It’s like telling the student to focus on understanding the core principles rather than memorizing every single detail.

Early Stopping is a practical technique where we monitor the model’s performance not just on the training set, but also on the validation set. Training is halted when the model’s performance on the validation set starts to degrade, even if its performance on the training set is still improving. This prevents the model from continuing to learn the “noise” of the training data and thus prevents overfitting.

Additionally, Cross-Validation is a more robust way to evaluate a model’s performance, especially with smaller datasets. Instead of a single train/validation/test split, the data is divided into multiple “folds,” and the model is trained and validated multiple times, with different folds serving as the validation set each time.

This provides a more reliable estimate of how the model will perform on unseen data. These techniques help ensure that the “learning” is truly applicable beyond the training examples.

How machines model reality

When a machine learning algorithm learns, it’s essentially trying to build a “model” of the relationships within the data. This “model” is a simplified representation of reality that allows the machine to make predictions or find patterns. The way these models are structured leads to a fundamental distinction: parametric versus non-parametric.

Before training, we define a “model family” or “hypothesis space.” This is the set of all possible programs or functions that our machine learning algorithm can choose from. Think of it as a blueprint or a template for the kind of solution the AI is allowed to find. This choice of model family is the first decision that constrains everything else, as it defines the fundamental capabilities and limitations of the learning system. For example, if we’re trying to predict house prices, our model family might be “all straight lines” or “all wavy curves.” The algorithm’s job is to find the best specific line or curve within that family.

Parametric vs non-parametric models

Parametric models are those where the model family has a fixed number of parameters (or “weights”) that need to be learned from the data. The model’s structure is predetermined, and learning involves finding the optimal values for these parameters. Their advantages include often having very efficient training algorithms because their structure is well-defined. The size of the model is independent of the size of the training data, making them good at “compressing” large datasets into a fixed-size “formula.” They are also often easy to regularize, helping to prevent overfitting.

However, parametric models bake in strong assumptions about the data. If these assumptions are wrong (e.g., assuming a linear relationship when the data is clearly curved), the model is very likely to underfit, failing to capture the true patterns. Examples include Logistic Regression (which finds a fixed set of coefficients), Naive Bayes (which learns a fixed set of probabilities), and Neural Networks (with a fixed number of layers and neurons).

Non-parametric models, in contrast, have a variable number of parameters that often depends on the size or complexity of the training data. Their structure is more flexible and can adapt to the data’s intricacies. These models often imply weaker assumptions about the data, making them more adaptable to complex or unusual datasets where the underlying relationships aren’t clear. They can “adapt to difficult datasets easier than similarly complex parametric models.”

However, training algorithms for non-parametric models can be more “ad-hoc” and less universally efficient. Their model size can sometimes be proportional to the training set size, making them less suitable for extremely large datasets if memory is a concern. Examples include K-nearest neighbors (which essentially “remembers” training examples), Support Vector Machines (SVMs) with non-linear kernels, and Decision Trees (where the number of “branches” can grow with data complexity).

Generative vs discriminative models

When a machine learning model is tasked with distinguishing between different categories (like “dog” vs. “cat”), it can approach the problem in two fundamentally different ways: by understanding how each category is generated, or by understanding what discriminates between them.

Generative models try to learn what the fundamental characteristics are that define each class independently. Their goal is to understand how an example of a given class is made. These models estimate the probability of observing specific features given a certain class (P(features∣Class)).

As an analogy, if you want to recognize dogs and cats, a generative model would first try to learn: “What is a dog?” (e.g., dogs generally have four legs, large noses, round ears) and “What is a cat?” (e.g., cats generally have four legs, small noses, pointy ears). To classify a new animal, it asks: “How likely is this animal to be a dog, based on what I know about dogs?” and “How likely is this animal to be a cat, based on what I know about cats?” Then it picks the more likely one. Because they learn how data is generated for each class, these models can often be used to generate new, random examples of each class. Naive Bayes is a classic generative model.

Discriminative models, on the other hand, try to learn directly what makes one class different from another. Their focus is on finding the boundaries or distinctions between categories. These models estimate the probability of a class given a set of observed features (P(Class∣features)).

As an analogy, instead of learning what a dog is and what a cat is, a discriminative model would learn: “What features best separate dogs from cats?” It might learn that “the larger the nose or the tongue, the more likely to be a dog,” or “the pointier the ears and the sneakier the eyes, the more likely it is to be a cat.” It might not care about features like “number of legs” if they don’t help in distinguishing the two. These models are excellent at computing “feature importance”—telling you which features are most useful for separating the classes. Logistic Regression and most Neural Networks are discriminative models.

Which is better? The trade-offs

There’s no single “best” type of model; the choice between generative and discriminative approaches, or parametric and non-parametric, depends on the specific problem you’re trying to solve, the amount and quality of data you have, and what you intend to do with the solution.

Discriminative models often excel at pure performance in classification tasks. They focus on the features that move the score and can disregard anything that doesn’t help. However, they learn “what’s useful,” which can sometimes be “something completely off-track, like spurious correlations or harmful biases in the training set.”

Generative models “often encode stronger inductive biases” because they represent a human’s hypothesis about how the data is created. This can make them “more robust and controllable,” but if that initial hypothesis is too far from reality, they “may not learn anything useful.” Ultimately, there is “no silver bullet.” You need to ask the right questions, be mindful of your assumptions, and always rely on experimentation to decide what works best in practice.

Understanding these trade-offs is what separates responsible AI development from negligent AI development, since the choices made can lead to models that learn unintended biases from data. The implications of this, such as the “Tank Problem” where a model learns irrelevant correlations, will be explored in more detail later in the book (e.g., in the “Dangers of AI” chapter).

A glimpse at common ML models

A profound theoretical result in machine learning, known as the No Free Lunch theorem, formalizes the idea that no single algorithm is universally superior. The theorem states that if an algorithm performs better on a specific subset of problems with certain data constraints, it must perform worse on all other problems. In essence, there is no universal, “one-size-fits-all” learning algorithm that can predict the future perfectly in all scenarios.

This means that to develop an optimal learning algorithm for a specific task, you must always introduce some “bias” or prior knowledge about the data and the problem itself. There is no such thing as an unbiased model, only models with different types of biases. This choice of what assumptions to make about your data is where the human modeler earns their keep. The following are some common models that embody different assumptions about the nature of a problem.

Decision Trees are like a flowchart for making decisions. A Decision Tree works by asking a series of simple questions about the data, such as “Is the customer over 30?”. Each answer leads to another question or a final decision, effectively narrowing down the possibilities until a prediction is made. They are generally easy to understand and visualize.

Logistic Regression is a simple yet powerful model often used for classification. It finds the “best line” (or a more complex boundary in higher dimensions) to separate different categories in data, helping to predict the probability that an input belongs to a certain class.

Neural Networks belong in this list by lineage, but they have grown important enough to deserve their own chapter. That is the next one.

Bayesian Methods use probability theory to update their beliefs about data as new evidence comes in. These models allow AI to reason with uncertainty, much like how we update our own understanding based on new information. Naive Bayes, for example, calculates the probability of a data point belonging to a class based on the probabilities of its features.

Kernel Machines, such as Support Vector Machines (SVMs), offer a clever way to handle data that isn’t easily separable. Imagine you have two groups of dots that are all mixed up on a flat surface, making it impossible to draw a straight line to separate them. Kernel machines can “lift” these dots into a higher dimension—a conceptual space—where they become easily separable by a simple line or plane. Then, they project that separation back to the original space. This “kernel trick” allows them to find complex, non-linear boundaries between categories.

From classical ML to the deep-learning earthquake

The introduction named Sutton’s Bitter Lesson: the methods that win, over the long arc of AI research, are the methods that scale with compute. Everything in this chapter is the classical-ML half of that arc. Kernel methods, ensembles, Bayesian models, decision trees, and the disciplined apparatus of supervised, unsupervised, and reinforcement learning are the techniques that work — often beautifully — in the moderate-data regimes where structure matters and labels are precious. They are the bedrock of practical machine learning in industry, in science, and in the credit-card transaction that just cleared while you were reading this paragraph. What the lesson predicts is what happens when the data is abundant and the compute is cheap — and that is where the rest of Part I goes. The methods that follow do not replace the ones in this chapter; they extend the same statistical paradigm into regimes the classical ones cannot reach.