3 Deep Learning

You have probably heard the phrase neural network a hundred times by now and you still do not know what one is. That is fair. Almost nothing in the public discussion will tell you.

By the end of this chapter you will be able to draw the machine on a napkin, explain how it figures out which of its thousands of knobs to turn, say why “deep” turns out to matter more than “clever,” and see the exact shape of the architectural problem that someone was about to solve in 2017.

This is the substrate everything else runs on. Language models are this machine, reshaped for words. Image and audio generators are this machine, reshaped for pixels and waveforms. Once you can see it clearly, you have most of the picture.

The neuron as a function

I want to start by killing the biology. The word neuron is the most misleading borrowing in the field. There is no electrochemistry here, no dendrites, no axons firing in synchrony. There is a tiny function, and that is all.

Picture a small box. It has a handful of inputs coming in — numbers, just numbers — and one number coming out. Inside the box: multiply each input by a weight (its own little dial set to some value), add the products up, add one more number called the bias, then squash the total through a smooth shape that compresses anything large into a small range. The output is one number. That is the entire primitive.

Every artificial neural network you have ever heard of is built out of millions or billions of these little functions, wired into long stacks. The functions are identical. Only the weights differ. Train a network and you are doing exactly one thing: adjusting the weights. The architecture is fixed in advance. The numbers inside the boxes are what learning changes.

Learning is nothing but adjusting the weights.

The next question is how. Picture a single one of these boxes, sitting on its own. You feed it an example: some inputs, an expected output. It produces its own output. Compare the two. Was the box right or wrong? If wrong, nudge each weight a little, in the direction that would have brought the box’s output closer to the expected one. Show it another example. Nudge again. Repeat across a few thousand examples. The weights settle. The box is now doing something that looks like a task. The intro mentioned that McCulloch and Pitts proposed the unit in 1943 and Rosenblatt built it in hardware as the Perceptron in 1957; this is the thing inside the box, and the perceptron learning rule is the procedure I just described, stripped of its formula.¹

Reread the last paragraph. The procedure is the most obvious thing in the world. That is by design. The genius is in the realisation that this rule, repeated billions of times across enough boxes, does most of what the modern field calls AI.

But there is a limit to what one box can do. Imagine the inputs as coordinates on a piece of paper, every example a dot. The output of one box, before squashing, is a weighted sum of those coordinates plus a constant. That is the formula for a straight line. A single box, no matter how cleverly tuned, can only carve its input space into two halves with a single straight cut.

If the world divided itself neatly by straight lines, this would be enough. Sometimes it does. Cats are, on average, heavier and taller than chihuahuas. Draw the right line in the height-weight plane and you separate most cats from most chihuahuas with one box. Many problems in the credit-card industry and the loan industry and the insurance industry look approximately like that. For those problems, a single neuron — what your introductory statistics class would call logistic regression — is genuinely sufficient.

For everything else, a single straight line is not enough. The next section is about what not enough looks like, mechanically.

Why one layer wasn’t enough

Imagine four dots on a sheet of paper, arranged at the corners of a square. The two on one diagonal — top-left and bottom-right — are red. The two on the other diagonal are blue. Stare at them for a moment.

There is no straight line you can draw on that page that puts both red dots on one side and both blue dots on the other. Try. Tilt the line however you like, slide it wherever you like. One of the dots is always on the wrong side. The configuration has a name. It is called exclusive or, or XOR. It has a small place in the history of the field, but the historical part is the intro’s. What matters here is the geometric fact. No single neuron, with any weights you can find for it, can produce the right output on those four examples. The straight line does not exist.

This is the wall that broke the empiricist branch in 1969. Two researchers proved it as a theorem, the field’s funding agencies read the proof, and for the next seventeen years the only people who kept working on neural networks were, by one later count, fewer than fifty researchers on Earth.² You can hold the entire generation that fought to keep the technique alive in your head as one room of stubborn people.

The fix, viewed mechanically, is the most natural thing imaginable. If a single line cannot split the dots, bend the boundary. And bending the boundary, by composing it out of pieces, is something the algebra of these little boxes can in fact do, as long as you stop using just one of them.

Put two neurons side by side, each with the same inputs. The first one draws a line that separates the top-left dot from the other three. The second draws a line that separates the bottom-right dot from the other three. Now feed both their outputs into a third neuron. That third neuron is no longer looking at the original coordinates. It is looking at the answers of the two boxes that went before it, and from those it can draw its own line through their combined output space. That line, projected back into the original coordinate space, is no longer straight. It bends around exactly the configuration of dots that one neuron could not handle. The wall is gone.

Put more such boxes in front, and another row of boxes after them, and the boundary you can draw becomes arbitrarily wiggly. Composition of straight cuts produces curved regions. That is the whole architectural idea, stated without a single equation.

The catch — and it is a real catch, with a price tag of seventeen years — is that nobody knew how to train the stacked version. The perceptron rule only knows how to blame the last layer for being wrong. When the network has three or four or thirty rows of boxes, and the answer at the end is wrong, which of the thousands of dials should you nudge, and in which direction? The middle layers have no expected output anyone wrote down. They produce internal numbers that are not visible to the world, and there is no obvious way to tell them what they should be producing instead. Open question, with the field stuck on it.

The next section is the answer.

Hidden layers and backpropagation

The rows in the middle, the ones whose outputs you do not see, have a name. They are hidden layers. The name is older than the technique that made them useful, and it lands the right idea. Nobody tells those neurons what to detect. They settle on patterns by themselves, in the course of being trained, and the patterns they settle on are whatever turned out to be useful for getting the final answer right.

I want to make that intuition vivid before we get to the procedure. Picture a hidden layer of three neurons, each drawing its own straight line through the input space. Together those three lines, depending on how they fall, can fence off a triangle of one colour from the rest of the plane. None of the three lines on its own knows it is part of a triangle. The output neuron, sitting on top of them, knows nothing about the input coordinates. It sees only three numbers, one from each line-drawer, telling it which side of each line the current point is on. The combination of those three booleans is enough to identify inside-the-triangle versus outside. The shape was assembled by composition. Nobody designed it.

Generalise that picture. Stack more neurons in each layer and you get more lines, and so finer-grained regions. Stack more layers and you get to combine those regions into ever more elaborate shapes. The expressive power scales beautifully with size. The only question, again, is how to train the dials when the layers in the middle have nobody telling them what they should be doing.

The 1986 answer, and the modern field’s foundational trick, is called backpropagation.³ Here is the entire idea, in one image.

The network has just produced an answer, and the answer was wrong. Measure how wrong it was, as a number. That number is the loss. The loss is the responsibility of the output layer, certainly. The output layer’s neurons each contributed something to the final answer; each of their weights gets a share of the blame, in proportion to how much that particular weight pushed the answer in the wrong direction. So far this is the same kind of nudging the single-neuron rule did, just over a row of neurons in parallel. Nothing new yet.

The leap is what happens next. Each neuron in the output layer also has an upstream connection to every neuron in the layer just below it. Those are the inputs it summed. We just decided that this output neuron’s own contribution to the loss had a certain shape. By the same logic, the inputs it received also contributed to that shape, in proportion to the weights connecting them. We can therefore pass a smaller share of the blame backwards to each of the neurons one layer down. They, in turn, take their fractional blame, attribute it among their own weights, and pass yet smaller shares of it back to the layer below them. The error trickles upstream from the output to the input, one row at a time, splitting at every connection in proportion to how much each connection pulled the answer the wrong way. By the time the procedure finishes — at the very first layer of the network — every single weight in the system has a number written next to it that says nudge me by about this much, in about this direction. The single update step is to apply all of those nudges at once.

That is the whole mechanism. It is called blame assignment, and that is exactly the right name for it. The chain rule of calculus is what makes the arithmetic work out, and it is mentioned by name in every textbook treatment of backprop, but the mathematical machinery is not the idea. The idea is propagate the blame backwards. Once you have that picture, everything else about how deep networks are trained is detail.

A small note before we move on. The squashing function inside each neuron has to be smooth. Smooth meaning that a small change to the neuron’s input produces a small, predictable change in its output. If the function jumped suddenly at some threshold, as the original perceptron’s did, the backwards-passed blame would have nowhere sensible to land at that jump, and the procedure would jam. This is the entire reason the field replaced the hard threshold with smoother shapes: the sigmoid in the 1980s, the rectified linear unit later on.⁴ The mathematics permits blame to flow only through functions whose response to input is continuous. The textbook lineage of these design choices is long; the chapter only needs the headline.⁵

So the algorithm was ready in 1986. Then it sat marginal for two decades. Why? Because the algorithm worked, but the compute did not. Training a network deep enough and wide enough to do anything interesting required millions of multiplications, repeated on each of millions of examples, repeated over hundreds of passes through the data. In 1986 that was a year of computing time for a problem the size of recognising handwritten digits. By 2009 it was an afternoon. The next section is about the algorithm that walks down the loss; the section after that, about the moment compute caught up and the wall fell.

Gradient descent, plainly

Backpropagation tells you, for every weight in the network, which direction to nudge it and by roughly how much. It does not tell you how big a step to take, when to stop taking steps, or how to make the procedure tractable when you have millions of training examples and cannot afford to look at all of them before each step. Those decisions belong to the algorithm that uses the gradients backprop computes. The algorithm has a name that sounds intimidating and a meaning that is not. It is called gradient descent, and the intuition for it is one image.

Picture the loss as a landscape. Every possible setting of all of the network’s weights is a single point on a multi-dimensional hilly surface. The height at each point is how badly the network performs on the training data with those weights. Training the network is walking downhill on that surface. Backprop is the procedure that, at any point you currently stand on, tells you which way is steepest down. Gradient descent is the procedure of taking a step in that direction, then asking the question again, then stepping again. You repeat until the ground stops sloping.

The size of each step has a name. It is called the learning rate, and it is the single most consequential dial in deep learning practice. If you set it too small, every step is microscopic; the ground takes a thousand years to flatten, and the network spends a thousand years not yet good at anything. If you set it too large, you bound across the bottom of the valley and up the opposite hillside, then bound back across, ricocheting forever without settling. The art of training a network for the first decade of the modern field consisted, to a depressing extent, of learning by feel which learning rate to use, when to lower it, and how to recover from runs that diverged. Modern training algorithms — Adam is the workhorse — maintain a separate learning rate for every weight and adapt those rates as the training proceeds, sparing the practitioner most of the misery.⁶

There is one more practical move that almost defines modern training. The landscape we described, where height equals the network’s badness on the whole training set, is too expensive to evaluate. Looking at every example before taking a step would mean reading the entire dataset, which today is measured in terabytes, between every single update. The trick is called stochastic gradient descent. Instead of surveying the whole landscape before each step, you glance at a small random batch of examples — typically a few hundred — and estimate the slope from just that handful. The estimate is noisy; your step is in roughly the right direction, but not exactly. You take it anyway. Step, glance, step. The noise turns out to be a feature, not a bug. It knocks the descent loose from shallow ditches that would have trapped a more careful procedure, and it gets the work done thousands of times faster than the alternative. Every modern network is trained this way.⁷

A worry that haunts every introduction to gradient descent, and that probably haunts the careful reader right now, is the question of local minima. The landscape, surely, must be full of small valleys that are not the lowest point but are lower than everything immediately around them. Stochastic descent walks downhill blindly. Would it not get stuck at the first such valley it finds?

The honest answer is the one practitioners arrived at by accident, training networks for years before anyone proved the theorems. In a low-dimensional landscape, the kind you can picture in two or three dimensions, local minima are everywhere and they are a problem. In the landscapes deep networks actually walk through, which have millions of dimensions, the geometry is different. A point that is a local minimum in every direction at once is astronomically rare; almost every apparently-stuck point is actually a saddle, where in most directions you are at the bottom of a valley but in at least one direction you could still go down further. Stochastic descent, with its noise, almost always finds the direction sooner or later and keeps moving. You do not get trapped; you get slow. Slow is solvable.⁸

The combination of backprop with stochastic gradient descent is the engine. Every network you have ever heard of, including the one writing chat replies on the device in your pocket, is trained by some refinement of this same loop. The refinements matter: momentum, adaptive learning rates, careful initialisation of the weights before training begins,⁹ dropout to keep the network from memorising the training set,¹⁰ batch normalisation to keep internal numbers in a workable range,¹¹ skip connections that let the gradient signal flow back through very deep stacks without dying out on the way.¹² Each refinement is a tweak on this skeleton, not a replacement for it. The skeleton has not changed since 1986. What has changed is everything around it.

Why depth is the whole point

Now we come to the section the chapter has been building toward. Everything so far — the box, the stack, the blame-propagation, the downhill walk — is mechanism. Mechanism by itself does not explain why deep neural networks took over. It does not explain why a network with twenty hidden layers blows past a network with two hidden layers, even when both have the same total number of weights and both are trained by the same algorithm. The answer is a single thesis sentence, and it is the most consequential idea in modern AI.

A deep network learns a hierarchy of features.

Let me make that vivid. Consider a deep network trained to recognise objects in photographs. Its input layer sees raw pixels. After training, if you crack the network open and look at what each layer has learned to respond to, you find something remarkable. The earliest layers, the ones closest to the pixels, have settled into detectors for short edges and gradients of colour. Tilt the photo, and certain neurons in that first layer fire harder; rotate it ninety degrees and a different set fires. They are an alphabet of edges.

Look one layer further in. Those neurons fire when combinations of edges appear: an arc, a corner, a small repeating texture. Look two layers further still, and the neurons are responding to parts. Something that looks suspiciously like an eye, or a wheel, or a leaf. Two layers beyond that, the responses are for whole objects: a face, a car, a tree. The hierarchy was not designed. The network’s training procedure built it, by composition, out of those tiny little boxes whose dials we were so carefully nudging. Each layer detects features that are built out of the features detected by the layer beneath it.¹³

I want to step back, because this is where the chapter’s spine pays off. Before deep learning, the human did this job. A computer-vision researcher in 2009 spent months — sometimes years — designing edge detectors, texture filters, colour histograms, and shape descriptors, hand-crafted feature pipelines that the rest of their model would then learn to weight. The hand-engineering was not the model. It was the input to the model. The model’s only job was to combine the human-designed features into a final answer. The features themselves were the most prized scientific artefact in the field, and they were a human creation through and through.

A deep network learns the features. The whole stack of edges-to-parts-to-objects is what training produced, from nothing but raw pixels and labels. Nobody designed an edge detector for the network. The network discovered the edge detector because edge detectors turned out to be useful for the only task it was actually being scored on: telling cats from dogs. This is the section where the what to measure job, which had been a human job for fifty years, fully moves to the machine. That is the revolution. The architectures of the next three decades are variations on this single idea.

A short footnote in the corner here, because the literature has a famous theorem and the chapter would feel incomplete without naming it. The universal approximation theorem says that a neural network with a single hidden layer can, in principle, approximate any continuous function to arbitrary precision, given enough neurons.¹⁴ The theorem is true, the theorem is celebrated, and the theorem explains nothing about why deep networks work in practice. It tells you what is theoretically possible with one layer; it does not tell you that you can actually train such a network on a finite amount of data with a finite amount of compute. The width you would need is astronomical. What deep networks do, by contrast, is achieve excellent approximation with reasonable amounts of compute, by reusing sub-features through composition. The composition is what depth buys you. Theoretical universality is a sideshow.

Why is composition so efficient? Because a feature that takes work to learn — say, an eye-detector — gets to serve as the input to many higher-level detectors at once. A face-detector uses it. So does an emotion classifier, a gaze tracker, a face-recognition system. The same earlier work, once learned, propagates outward to a whole population of downstream uses. A single wide layer, by contrast, has to learn the eye-detector independently for every higher-level task it wants to support. Reuse is what makes depth cheap where width is expensive, for any problem where the world itself has hierarchical structure. And the world, it turns out, mostly does.

Here is where the chapter’s other through-line lands. Two sections back I mentioned, in passing, that the modern field has a slogan called the Bitter Lesson. It is the observation, due to Richard Sutton, that across the history of AI research, methods which scale with compute consistently overtake methods that rely on hand-crafted human cleverness.¹⁵ Now you can see the mechanism. Hand-engineered features plateau, because the humans who design them have finite time and finite imagination. Learned features scale, because adding more compute and more data lets the network discover better features automatically. That is the lesson, not as a slogan but as a piece of machinery. Once you have a stack of differentiable layers, a loss function, and the engine that walks downhill, more compute and more data buy you better features. Linearly and predictably, with no known upper bound on the trick. The empiricist branch the intro narrated finally had its winning ground.

The architectures that mattered

The recipe in the last five sections is generic: some stack of differentiable layers, some loss, some training loop. Once the recipe works, the open question shifts to which shape of stack — which arrangement of neurons and connections — is the right one for the kind of data you have. Three architectures dominated the decade between AlexNet and the transformer. Each one is a clever rearrangement of the same primitive, and each one solves a different shape-of-the-data problem.

Convolutional networks, for images. A photograph is a grid of pixels. Two pixels next to each other are almost certainly related; two pixels at opposite ends of the image are almost certainly not. The plain fully-connected network does not know this. It treats every pixel as exchangeable with every other, and it tries to learn from scratch that pixels close together matter together. That is wasteful. So is the second problem: a cat in the top-left of the image and the same cat in the bottom-right are the same cat, and a fully-connected network has to learn to detect cats independently in every position.

The convolutional network bakes both observations into its wiring. Instead of every neuron in a layer being connected to every input, each neuron looks at only a small patch of the image at a time — a receptive field — and the same little detector is slid across the whole image, applying the same weights at every position. One trained edge-detector serves the entire photograph. The network has many such sliding detectors, each tuned to a different local pattern, and stacked above them are more layers of sliding detectors that combine the previous layer’s responses into bigger patches. The hierarchy of features I described in the last section is what these stacks discover in practice. It is the architecture that won ImageNet in 2012 by ten points, and the architecture every serious computer-vision system descends from.¹⁶

Recurrent networks, for sequences. Language is a sequence: one word follows another, each token’s meaning shaped by the tokens that came before it. The convolutional shape is wrong for sequences. The relevant context is not a small spatial window, and the patches do not commute. The recurrent network handles this by adding time as an explicit dimension. The network reads the sequence one token at a time, and at each step it carries a hidden state: a vector of numbers, summarising everything it has seen so far. Each step, it updates the hidden state from the previous step and the new token, then emits an output. The hidden state is the network’s running memory.

It is a clever idea, and for a few years it was the default tool for everything sequential: translation, speech recognition, text classification, music generation. It has two structural problems, both consequential. The first is that it has to be processed in order. You cannot start step 17 before you finish step 16, because step 17 depends on the hidden state that step 16 produces. This is fine when sequences are short, but it bottlenecks training when sequences are long, and it does not parallelise across the kind of hardware that made deep learning take off in the first place — the GPU, built for doing the same operation on thousands of things at once. The second problem is subtler. The hidden state has fixed size; the network has to keep cramming new information into it, and old information gradually washes out. By the time the network has read a long paragraph, the gradient signal flowing back to the first sentence — the signal that would teach the network how the beginning influenced the end — has shrunk to near zero. The network forgets. Variants like the long short-term memory cell push the forgetting back a long way, but they do not eliminate it.¹⁷

I will frame recurrent networks honestly. They are a stepping stone. They worked, they were the state of the art for a while, and they had structural ceilings that the next architecture demolished.

Sequence-to-sequence and the attention patch. In 2014, a small change to the recurrent recipe turned machine translation into a tractable deep-learning problem.¹⁸ The idea was to chain two recurrent networks. The first one, the encoder, reads an entire input sentence and produces a single vector that summarises it. The second one, the decoder, takes that vector and unrolls it back into the output sentence, one token at a time. Encode the English sentence, decode it back as French. The same architecture handled summarisation, dialogue, code generation, and anything else where the input was a sequence and the output was a different sequence.

The single-vector bottleneck was the obvious problem. One vector to summarise a paragraph is a lot to ask. So later in 2014, three researchers in Montreal published a patch.¹⁹ Instead of forcing the decoder to rely on a single summary vector, they let it look back at every position in the encoder’s output as it generated each token, with a learned weighting that decided how much attention to pay to each input position. The patch was called attention, and it was the missing piece for getting long-distance dependencies right.

Notice the shape of the situation. The recurrent spine is still there. Each step still depends on the previous one, training is still sequential, the forgetting is still partially papered over rather than solved. Attention is bolted on as an extension. It is not a foundation. A team of researchers in 2016, looking honestly at the architecture, could see the next move from a mile away.

The problem still open in 2016

By the close of 2016, the empiricist branch the intro described had every component it needed for a working brain. Backpropagation. Gradient descent on a small batch at a time. GPUs cheap enough to be measured by the rack. Datasets the size of the public internet. A stack of differentiable layers that learned its own features. The recipe worked. The recipe was scaling. Sutton’s bitter lesson had a working substrate underneath it.

There was, however, a specific architectural debt to clear. Pull together the three drags identified in the last section. One, recurrent networks are slow to train because they process sequences in order, leaving most of the GPU idle most of the time. Two, recurrent networks forget the beginning of long inputs, no matter how cleverly their hidden state is engineered. Three, attention — the patch that fixed the worst of the forgetting — is a graft onto the recurrent spine, not its replacement. The graft is not a clean piece of engineering. Everyone working on these models in 2016 knew that.

So the obvious question hung in the air. Could you build a sequence model with no recurrence at all — pure attention, applied directly to the input, with no hidden state passed along a temporal axis — that trained in parallel across the entire sequence at once, and never forgot anything inside its window? On paper it should be possible. In practice nobody had shipped one.

That is the gap. Six months later, eight researchers at Google would publish a paper that closed it.

The brain without the architecture

You can now draw the machine. You can say where the dials live and how learning adjusts them. You can explain blame propagation to a friend without invoking calculus, and you can describe the walk downhill without invoking optimisation theory. You can say why depth matters — not “deeper networks are better” as a slogan, but the specific mechanism by which composition of features automates the job a human used to do by hand. You can name the three structural problems that recurrence inherits and see, in advance, the shape of the architecture that solved them.

The empiricist branch had been waiting for its substrate for half a century. In 2012 it found one. In 2016 the substrate had everything but one architectural piece, and you can now see exactly the shape of the missing piece.

The next chapter is the shape of the piece.

Schmidhuber, J. Deep Learning in Neural Networks: An Overview. Neural Networks 61 (2015): 85–117. arXiv:1404.7828, 2014. Comprehensive genealogy of backpropagation, CNNs, and LSTMs in the author’s own words; substitutes for the inaccessible scanned originals.↩︎
Schmidhuber, J. Deep Learning in Neural Networks: An Overview. Neural Networks 61 (2015): 85–117. arXiv:1404.7828, 2014. Comprehensive genealogy of backpropagation, CNNs, and LSTMs in the author’s own words; substitutes for the inaccessible scanned originals.↩︎
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536, 1986. The algorithm was discovered independently several times in earlier decades; see Schmidhuber 2014 for the full lineage.↩︎
He, K., Zhang, X., Ren, S., & Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv:1502.01852, 2015. The paper that paired the rectified linear unit with the initialisation scheme tailored to it.↩︎
Goodfellow, I., Bengio, Y., & Courville, A. Deep Learning. MIT Press, 2016. https://www.deeplearningbook.org/ — the canonical textbook treatment of the topics in this chapter.↩︎
Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.↩︎
LeCun, Y., Bengio, Y., & Hinton, G. Deep Learning. Nature 521, 436–444, 2015. The three architects’ joint statement of the paradigm.↩︎
Goodfellow, I., Bengio, Y., & Courville, A. Deep Learning. MIT Press, 2016. https://www.deeplearningbook.org/ — the canonical textbook treatment of the topics in this chapter.↩︎
Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010 — the “Xavier” initialisation.↩︎
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15:1929–1958, 2014.↩︎
Ioffe, S. & Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167, 2015.↩︎
He, K., Zhang, X., Ren, S., & Sun, J. Deep Residual Learning for Image Recognition. arXiv:1512.03385, 2015. Introduces the skip connections that made networks with hundreds of layers trainable.↩︎
LeCun, Y., Bengio, Y., & Hinton, G. Deep Learning. Nature 521, 436–444, 2015. The three architects’ joint statement of the paradigm.↩︎
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2, 303–314, 1989. Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Networks 4 (1991): 251–257.↩︎
Sutton, R. The Bitter Lesson. 2019. http://www.incompleteideas.net/IncIdeas/BitterLesson.html↩︎
Krizhevsky, A., Sutskever, I., & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012. The ten-point ImageNet result that pivoted the field overnight.↩︎
Schmidhuber, J. Deep Learning in Neural Networks: An Overview. Neural Networks 61 (2015): 85–117. arXiv:1404.7828, 2014. Comprehensive genealogy of backpropagation, CNNs, and LSTMs in the author’s own words; substitutes for the inaccessible scanned originals.↩︎
Sutskever, I., Vinyals, O., & Le, Q. V. Sequence to Sequence Learning with Neural Networks. NeurIPS 2014.↩︎
Bahdanau, D., Cho, K., & Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473, 2014. The first appearance of attention as a learned alignment, grafted onto a recurrent encoder–decoder.↩︎