5 Generative AI
The machine from the last chapter — tokens in, probability distribution out, sample, append, repeat — is not only a language machine. It is a generative machine that happens to have been pointed at language. Point it at something else and the cycle still runs. What changes is the alphabet.
By the end of this chapter you will be able to take that one idea — learn a distribution, sample from it — through pixels, waveforms, video, and the strange in-between case where a single model handles all of them at once. Each modality starts as its own engineering problem with its own architecture, and the field spends roughly a decade building bespoke machines for each. Then, around 2023, the bespoke machines collapse into one, and the distinction between a language model and a generative model stops making sense.
Leaving language
There is exactly one idea that survives leaving text. Look at a pile of data — sentences, images, recordings, frames of video, whatever — and learn the probability distribution that produced it. Then sample from that distribution to make new examples. That is what generative means, mechanically. Language modelling is one special case of it — the one where the data happens to be a sequence of tokens.
The shapes of the data are different, and the different shapes are most of why the rest of the chapter exists. Language is a one-dimensional sequence of tokens. An image is a two-dimensional grid of pixels, each of which carries three colour values, which is a much larger object than it sounds — a one-megapixel photograph is three million numbers. Audio is also one-dimensional, a waveform sampled forty-four thousand times a second, but the structure that matters to a human ear lives at timescales the raw samples do not reveal. Video is the worst of all worlds. It is spatial like an image and temporal like audio, simultaneously, and a few seconds of it weighs in at billions of numbers.
A generative model for any of these is a function that asks the same question chapter 4 asked: what plausibly comes next, given what is already there? In language the answer is the next token. In an image, “next” can mean the next pixel, or the next patch, or — as we will see — the next step of removing noise from a blurry version of the whole picture. In audio it can mean the next waveform sample, or the next codec token. In video it can mean the next frame, or the next patch in time and space. The answer shape changes with the modality. The shape of the question does not.
What follows takes each modality on its own terms, then folds them together. Images first, since they got there first and the lessons from them anchor everything else.
GANs
In June 2014, a graduate student in Montreal named Ian Goodfellow proposed a training procedure so simple it sounds almost like a joke.1 Train two neural networks against each other. One — the generator — takes a random vector of noise and tries to turn it into an image. The other — the discriminator — takes an image, real or fake, and tries to decide which it is. The two networks play a game. The generator wants to fool the discriminator. The discriminator wants to catch the generator. They train together. Each one’s loss is the other one’s win.
The cleverness, once you see it, is what the game avoids. There is no need to specify what makes an image good. There is no hand-written cost function for plausibility or realism — words that would have been impossible to encode in code in 2014. The discriminator learns what real looks like from data, and the generator learns what real looks like from the discriminator. Both networks are the deep-learning machine chapter 3 built; nothing new in the parts, everything new in the arrangement.
Two ideas come out of GANs that survive the entire chapter, so it is worth slowing down on them.
The first is the latent space. The generator’s input is a vector of, say, a hundred random numbers. Different vectors produce different images. Train the network well and the vector space organises itself: nearby vectors produce visually similar images, and you can walk a smooth path from one vector to another and watch the corresponding images morph into each other coherently. Faces grow older, lighting shifts, expressions soften, all by walking a straight line through a hundred-dimensional space. The latent vector is not the image; it is a coordinate on a compressed map of the space of plausible images. That map is the deep prize of the technique. Every generative model in this chapter has one.
The second is the trick of training without a hand-written objective. Adversarial training was the first crisp demonstration that you can let two networks negotiate a quality signal between them, instead of writing one down. The technique generalises far beyond images, and a chunk of modern model training — including the preference-optimisation step in chapter 4 — descends from this insight more than it descends from any particular GAN paper.
By 2018 the technique had matured into StyleGAN, the system that gave the world thispersondoesnotexist.2 A user reloads a webpage and is shown a face. The face is not a person. The face has never been a person. The pores are right, the catch-lights in the eyes are right, the wisps of hair against the forehead are right, and the face does not exist. The cultural moment landed harder than any technical paper of the era. For roughly two years, what a GAN looks like was a question one no longer had to teach; you could just open a browser tab.
GANs had a failure mode that ultimately broke them, though. Mode collapse. The generator, in trying to fool the discriminator, sometimes discovers one image — or a small family of images — that the discriminator cannot reliably catch, and then it produces that one over and over. Training stops exploring. The latent space, instead of spreading evenly over the data distribution, bunches into a few attractors and leaves vast regions of plausibility empty. There are tricks to mitigate it. None of them are clean. By the time the field found a generative technique that did not have this problem, GANs were already being displaced.
The lineage matters more than the throne. GANs lost the leading-edge race around 2021, but the latent-space picture they bequeathed is the idea everything after them builds on. Hold on to it. Cleaner ways to populate a useful latent space and walk around inside it are what the rest of the modalities are reaching for.
Diffusion
The idea that replaced GANs sounds, on first hearing, even more like a joke. Take a real photograph. Add a little visual static to it. Add a little more. Repeat the operation a thousand times. By the end, what was a photograph is now pure television-snow, indistinguishable from a random noise pattern. The reverse direction — going from snow back to a photograph in one step — is plainly impossible. The trick is to do it in a thousand small steps. Train a network on the question: given an image that is slightly noisier than it should be, what was the previous, slightly cleaner version? Each step is small enough that the network can learn it. To generate a new image, you start from snow and run the trained network in reverse, peeling off one layer of noise at a time, until something emerges that was never there.
The technical name is a denoising diffusion probabilistic model, and the canonical paper is from Berkeley in 2020.3 The intuition is the part that matters. The network never learns “how to draw”; it only learns “how to remove a little bit of noise.” Composition does the rest. A thousand small denoisings, run in sequence on a starting point that is pure noise, produce an image.
Diffusion beat GANs on every axis that mattered. Training was stable: no adversarial knife-fight, no mode collapse, just a regression problem with a clean loss. The samples were higher quality and more diverse, because the noisy starting points covered the latent space evenly rather than getting trapped near the discriminator’s blind spots. The same architecture could be conditioned on text, on layout, on a sketch, on another image — anything that could be encoded into a vector could be poured in as an extra input and steer the denoising.
The catch was cost. Doing the whole process at the full resolution of a photograph was punishing. Each of the thousand denoising steps had to run a full deep network over millions of pixels, and at the scale of a megapixel image this added up to hours on a single GPU. For a couple of years diffusion was beautiful and inaccessible.
The unlock came in late 2021, from a group at Heidelberg. Their paper is called High-Resolution Image Synthesis with Latent Diffusion Models, and it became famous under a shorter name: Stable Diffusion.4 The idea is to do the whole noisy dance not in the space of raw pixels but in a compressed latent space — the same kind of compressed map GANs had been using, learned by a small encoder–decoder network and frozen in advance. Run the thousand denoising steps over a sixty-by-sixty grid of latent codes instead of a thousand-by-thousand grid of pixels, then decode the result back into a picture at the end. The arithmetic shrinks by orders of magnitude. The image quality does not. Suddenly the same machine ran on one consumer GPU, in seconds rather than hours, and the weights were small enough to download.
Text conditioning was the second piece, and the mechanism is worth naming precisely even though we will not fully unpack it until the next-but-one section. A text prompt is passed through a text encoder — the encoder half of a system called CLIP, which is the hinge of this chapter and which §5 takes apart properly — and the resulting vector is fed into the denoising network at every step, via the same attention mechanism chapter 4 explained. The picture being denoised is steered, at every step, by what the text vector tells the network to look for. Type a cat in a top hat, and the noisy patches start to organise themselves into a face with whiskers under a tall black cylinder, because at every step the denoiser is being pulled toward the part of latent space the text vector points at.
In 2022 this combination — latent diffusion, text conditioning, weights you could download — produced the public phenomenon the intro called the generative turn. The intro told you the names: DALL·E 2 in April, Midjourney in July, Stable Diffusion in August.56 Within ninety days, anybody with a credit card or a graphics card could type a sentence and get a picture. The pictures were, by any historical standard, miraculous. They were also, occasionally, cursed. The cursed-hands era is the artifact to remember. For about a year, image models could draw almost anything except the human hand, which would emerge with six fingers, with thumbs in the wrong places, with knuckles bending backward, with whole fingers fused into asymmetric bone-coloured clubs. The internet was full of pictures of beautiful imaginary people with terrible hands.
The cursed-hands era ended quickly. Better training data, better captions, careful augmentation — by 2024 the major systems had largely solved the problem the same way they had solved every previous artifact, which is by throwing more data and more compute at it. The pattern is worth naming, because we will see it again. The artifacts of a generative model are not eternal bugs. They are the parts of the distribution the training data underrepresented or mislabelled, and they shrink as the data improves.
Video belongs to this section as a one-paragraph honesty. By 2024–25, systems like Sora, Veo, and Runway had pushed diffusion into the time dimension — generating short clips by denoising a stack of frames jointly so that objects persist across them. The samples are striking, sometimes beautiful, often physically wrong in ways that are hard to articulate until you watch them twice. A glass falls and the shadow ignores the fall. A leg crosses behind another leg and the foot reappears on the wrong side. Temporal coherence is the hard part, and as of 2026 it is not yet solved. §7 will return to this briefly. For now, the takeaway is that diffusion is the substrate underneath video too, and the same mechanism that makes the still images astonishing is what makes the moving ones almost-but-not-quite right.
Whisper and audio generation
Audio cuts in two directions. Listening — converting a recording of speech into text — and speaking — converting text into a recording of speech. There is a third, composing, which is making music from scratch, and we will touch it at the end. The fundamental observation is that all three are the same trick in different coats. Once you can turn sound into a sequence the transformer from chapter 4 understands, the rest of the apparatus follows.
Start with listening. The state-of-the-art system as of 2026 is Whisper, released by OpenAI in 2022.7 The mechanism is two ideas stacked. First, take a raw audio waveform and convert it into a spectrogram — a two-dimensional image where one axis is time, the other axis is frequency, and the brightness at each point is how much energy the recording carried at that frequency at that time. A spectrogram is, mechanically, a picture of sound. Second, treat it as a sequence-to-sequence problem in exactly the chapter-4 sense: the spectrogram patches are the input tokens, the text transcript is the output, and a transformer between them learns the mapping. That is the entire system, in one line.
The clever part is the training data. Whisper was trained on six hundred and eighty thousand hours of audio scraped from the public web, paired with whatever subtitles or transcripts accompanied them. Most of it was English; a substantial fraction was other languages; the rest was a mess of accents, background music, cross-talk, varying recording quality, and channel noise. The model trained on the whole pile at once, and it emerged generalising in a way previous speech-recognition systems did not. You can hand Whisper a poorly-recorded voice memo in Caribbean Spanish with a fan running in the background, and it transcribes it. The voice notes that scoped the outline this chapter is built from were transcribed by exactly this machine.
Now speaking. Text-to-speech is the inverse problem, and the modern recipe runs in three steps. Text in. A network converts the text into linguistic features — the phonemes that will be spoken, the rhythm, the pitch contour, the emotional inflection. A second network, the vocoder, converts those features into a waveform. The lineage runs from DeepMind’s WaveNet in 2016, which was the first network to generate raw audio sample-by-sample at sixteen thousand samples per second and proved the technique was possible, through a generation of neural codecs that compressed audio into tokens, into today’s high-fidelity commercial systems like ElevenLabs and OpenAI’s voice mode.8 The state of the art in 2026 is good enough that a clean, well-paced sample of synthesised speech can be mistaken for a recording of a person, by anyone not listening for tells.
That capability is also where the chapter has to be sharp. Voice cloning — building a model that can speak in a specific person’s voice from a brief sample of that voice — used to require an hour of clean audio recorded by the target person in a studio. It now requires a few seconds. The system listens to a voicemail, extracts the speaker’s vocal characteristics into an embedding, and from then on can read any text in that voice. The technology is astonishing. It is also, in its current commercial deployment, grotesque. Phone scams targeting elderly parents with the cloned voice of a grandchild in distress are a real category of crime in 2026, and the technical floor for committing one has been lowered to the price of a monthly subscription. The book holds both observations together, which is what the techno-pragmatist register is for. Part III takes the ethics apart properly; here the job is to name the capability with the right sharpness and move on.
Music briefly. Systems like Suno and Udio in 2024–25 generate songs from text prompts — full arrangements with vocals, in stylistic genres of the user’s choosing. The architecture is family-resembling to the rest of this section: a neural codec tokenises audio into discrete pieces, and a transformer learns to predict the next piece. The hard part is what no other modality has to solve at quite this scale — long-range coherence over minutes. A song has to remember its key, its verse–chorus structure, its melodic motifs across a three-minute span. As of 2026 the systems mostly do, sometimes do not, and the failure modes are interesting in ways that go beyond this chapter.
The bridge sentence to the rest of the chapter is the one to remember. Once audio is tokenised — by Whisper-style spectrogram patches on the input side, by neural-codec tokens on the output side — generating it becomes next-token prediction again. The transformer from chapter 4 does not have to know it is working on sound rather than text. It just sees tokens.
CLIP and the multimodal bridge
Here is the question that organises the rest of the chapter. How do you teach a machine that the word dog and a photograph of a dog are about the same thing?
For most of deep learning’s history, the answer was you don’t. Vision models were trained on labelled images, where the label was an integer from a fixed list of, say, a thousand category names. Language models were trained on text, where the words were tokens drawn from their own vocabulary. The two systems lived in incompatible spaces. A vision model produced a vector that meant “this image is in category 487”; a language model produced a vector that meant “the next token is probably retriever”. Those two vectors had no shared coordinate system. They lived in different countries with different currencies.
In January 2021, a paper from OpenAI changed that. The paper is called Learning Transferable Visual Models From Natural Language Supervision, and it introduced the system the field calls CLIP.9 The idea, told as plainly as it can be told, is this. Train two networks at the same time on a giant dataset of (image, caption) pairs scraped from the public web — four hundred million of them, in CLIP’s case. One network reads images. The other reads text. Both produce vectors of the same dimensionality. The training objective: for each (image, caption) pair, push the two vectors together; for each mismatched pair from the same batch, push them apart. That is contrastive training. The image of a golden retriever and the caption a photo of a golden retriever end up landing near each other in a shared vector space. The image of a golden retriever and the caption a slice of pizza end up far apart.
What that joint training buys is the most consequential single idea of the multimodal era. After it, the image encoder and the text encoder share a coordinate system. A picture and a sentence can now be compared directly by computing the distance between their vectors. There is no longer any need for a fixed label set or a translation step between the two modalities. Meaning, for both pictures and words, is now expressed in the same currency.
The implications fall out one after another. Zero-shot classification: given an image of an unfamiliar object, embed it, then embed each of a list of candidate descriptions — a photo of a Border Collie, a photo of a Bernese Mountain Dog, a photo of a Welsh Corgi — and pick the one closest to the image vector. No retraining. No fine-tuning. No labelled examples of the specific breed. CLIP can do the job because the joint space already encodes the relationship. Cross-modal search: you have a database of a million photographs and you type children playing on a beach at sunset; you embed the sentence, you compare its vector to every image vector, you return the closest few. The same trick. Image generation conditioning: this is the one §3 promised. The CLIP text encoder is exactly the steering wheel a diffusion model uses to point the denoising process at the right region of image space. A cat in a top hat is, mechanically, a coordinate in CLIP space, and the diffusion network’s job at every step is to nudge the noisy image toward that coordinate.
The trick generalises beyond pictures and text. CLAP extends it to audio: train an audio encoder and a text encoder jointly, and the joint space lets you search for sounds with text queries or vice versa.10 MuLan and its descendants do the same for music. The pattern is the same pattern every time. Pick two modalities you want to relate. Find a giant dataset of paired examples. Train two encoders jointly with a contrastive objective. The shared space falls out.
That generality is what makes CLIP the hinge of the chapter. If every modality can be mapped, by the same recipe, into one shared coordinate space, then the next move is obvious. Stop training modalities separately and then stitching them together. Train one model on all of them at once, from the start, with all the modalities flowing through the same machine. The last section is what that looks like.
Native multimodal models
For most of the period this chapter covers, multimodal AI was a pipeline. You took an image encoder — frequently a CLIP-trained one — and bolted it to the front of a language model with a small adapter network in between, training the adapter to project image vectors into something the language model could read as if they were tokens. The result, called a vision-language model, could look at a picture and answer questions about it. It worked. It was also, architecturally, two systems duct-taped together. The image encoder and the language model had been trained separately, in different runs, with different objectives, and the adapter was the seam.
The next move, which became dominant around 2024, was to dissolve the seam. Train a single model from the start on an interleaved token stream containing all modalities. The recipe has three pieces. First, a tokeniser per modality: byte-pair encoding for text, a patch tokeniser that chops images into small squares and embeds them, a neural codec for audio. Second, a single very large transformer trained on streams in which all those token types appear together, in any order, the way a human conversation might switch from speech to a referenced photograph to a written caption without changing register. Third, decoding heads per modality on the output side, so the same model can emit text tokens, image patches, or audio codec tokens depending on what the task wants. The transformer itself is the chapter-4 machine, unchanged. It does not know which kind of token it is reading or writing. It only knows the next-token-prediction objective.
The systems that exemplify this in 2026 are easy to name. GPT-4o — o for omni — was OpenAI’s 2024 release that handled native audio in and audio out, alongside text and images, and as a side effect produced sub-second voice replies because the audio was being generated by the same machine that was thinking the answer. Gemini, Google DeepMind’s frontier line, was multimodal from its first version, with native handling of text, images, audio, and video in a single architecture. Claude, Anthropic’s family, was vision-native from version 3 and added audio input by 2025. Treat these as examples-as-of-2026 rather than canon: the names will date faster than the mechanism. The mechanism is the part to keep, and the mechanism is what the previous paragraph described.
Here is what the convergence implies. By 2026 the distinction between a language model and a generative model for other modalities has largely stopped mattering. A frontier model is one model, trained from scratch on tokens of every shape, that can read and write any of them on demand. The text machine of the last chapter and the image and audio machines of this one were the same machine all along, looking at different alphabets. The only thing that changed was what the alphabet was. The transformer that emits the next token does not care.
What’s still hard
Three honest caveats before the chapter closes.
Long-horizon video remains unsolved as of 2026. The frame-by-frame snapshots are increasingly photoreal. The physics across frames drifts. An object that was on the table moves and reappears under it; a shadow ignores its source; a person turns their head and a different person turns back. The problem is not architectural cleverness. The problem is that the model has no internal world model strong enough to enforce that objects persist with consistent properties across time. The bigger the time window, the worse it gets, and there is no clean line in sight where this will be solved by simply scaling further. Whether scaling will solve it anyway is one of the open empirical questions of the next few years.
Multi-speaker audio is the other honest weak spot. Modern systems handle one voice in clean conditions beautifully. Two people overlapping in a noisy café still confounds them. The hard part is not perception of any individual voice; it is separation — deciding which sounds belong to which speaker, in real time, when the streams overlap. Humans do this constantly. The systems we have are still distinctly worse at it than a six-year-old in a crowded room.
The deepest open question is the one Part III has to take seriously, and it sits across every modality the chapter has covered. Does the model understand the image, or has it learned what captions tend to look like for images statistically near it? CLIP can pick out a golden retriever in a photograph. Whether the network has learned the concept dog, or whether it has learned the statistics of how photographs of dogs are described in web captions, is a question the model itself cannot answer for us. The first reading is what most people imagine when they hear the word understand. The second is much closer to what the training procedure actually optimises for. Part III’s chapters on alignment and limitations pick this up from the other side, where the same gap shows up as bias, as fragility under distribution shift, as confident wrongness under domain change. Here, the job is only to name it, and to hold the two facts together.
The convergence is real and astonishing. It is also not the same thing as understanding. Both have to be held at once. The next chapter wraps a loop around these machines and starts asking what they can be made to do in the world. The answer, with care, is a great deal.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. Generative Adversarial Networks. NeurIPS, 2014. arXiv:1406.2661. The paper the intro pointed at when it said adversarial training arrived in Montreal in 2014.↩︎
Karras, T., Laine, S., & Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. CVPR, 2019. arXiv:1812.04948. The architecture behind thispersondoesnotexist.↩︎
Ho, J., Jain, A., & Abbeel, P. Denoising Diffusion Probabilistic Models. NeurIPS, 2020. arXiv:2006.11239. The paper that established diffusion as a practical training recipe for high-quality image generation.↩︎
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, 2022. arXiv:2112.10752. The latent-space trick that made diffusion run on a consumer GPU; released as Stable Diffusion later that year.↩︎
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022. The DALL·E 2 / unCLIP paper.↩︎
Saharia, C. et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. NeurIPS, 2022. arXiv:2205.11487. The Imagen paper from Google, which alongside DALL·E 2 and Stable Diffusion defined the 2022 generative turn.↩︎
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356, 2022. The Whisper paper.↩︎
van den Oord, A. et al. WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499, 2016. The first neural network to generate raw audio waveforms sample-by-sample at fidelity high enough for production speech synthesis. The neural-codec lineage that runs through to today’s text-to-speech systems descends from it via VQ-VAE (van den Oord, Vinyals, & Kavukcuoglu, Neural Discrete Representation Learning, NeurIPS 2017, arXiv:1711.00937).↩︎
Radford, A. et al. Learning Transferable Visual Models From Natural Language Supervision. ICML, 2021. arXiv:2103.00020. The CLIP paper. The contrastive image–text training procedure that anchors the multimodal era.↩︎
The same recipe, audio in place of images: see for example CLAP: Learning Audio Concepts From Natural Language Supervision (Elizalde, Deshmukh, Al Ismail, & Wang, ICASSP 2023, arXiv:2206.04769) and Google’s MuLan: A Joint Embedding of Music Audio and Natural Language (Huang et al., ISMIR 2022, arXiv:2208.12415).↩︎