Why deep learning is mysterious
Deep neural networks trained with gradient descent generalise well. This is surprising.
Deep learning works: neural networks can successfully learn to form grammatically correct sentences, generate coherent images, and transcribe audio input. This is done by following something like the following steps:
Obtain a dataset containing examples of the task being performed correctly (e.g. text sequences for the task of generating tasks, image-label pairs for the task of classifying images);
Pick a suitable neural network architecture;
Train until the examples are handled correctly.
This is surprising! In particular, it is surprising that the trained neural network behaves correctly on inputs that are not part of the training dataset. To illustrate why, consider an analogy.
An analogy
We have some data consisting of two groups of shapes:
Now suppose I gave you a blue square. Does it belong to group A or group B?
Well, there are at least two hypotheses that are consistent with the data shown here. Perhaps shapes belong to A if and only if they are blue, in which case a blue square would belong to A. Or maybe shapes belong to A if and only if they are blue and not square. Then the blue square would have to belong to B. There’s nothing in the data to distinguish which of these rules is the true one for describing A and B, because we’ve never seen a square!
One way to resolve this could be to say that, in the real world, simple hypotheses are more likely (this is the idea of Occam’s razor). Then we have to put the blue square into A, because hypotheses that include provisos like “A contains everything blue except squares” are too complex to be considered likely.
How this related to machine learning
A machine learning algorithm is in the position that we were in with the shapes. It “sees” a dataset of example inputs, and wants to learn what the true rule is for every possible input. For example, it might see images of handwritten digits along with labels saying what digit (0-9) is shown in each image, and then learn how to classify all possible handwritten digits.
The same problem arises: there are a huge number of possible hypotheses that correctly handle the training examples but act randomly on the rest of the possible inputs (hypothesis here is used to mean a function from inputs to outputs). If a solution performs well on all possible inputs, it is said to generalise well. The learning algorithm has no a priori way to distinguish between hypotheses that generalise well and ones that (for example) just memorise the training dataset and spit out the correct answer for those training examples without learning any general principles about how to perform the task.
One way to overcome this problem is to restrict our attention only to sufficiently simple hypotheses. Linear regression does this by only being able to represent linear functions, which are extremely simple (far too simple to do most tasks we care about!). Or we could have a setup where we can technically represent even complex hypothesis, but in our training process we have a bias towards simpler explanations of the training examples.
But deep learning works well even with none of these components being explicitly present! Deep learning uses models with huge numbers of parameters, that are capable of representing many complex functions1. And even when no explicit bias towards simple solutions is used, deep learning can work well (although biases are sometimes used, e.g. weight decay).
So it is conjectured that when training deep neural networks with gradient descent, there is an implicit bias towards solutions that are likely to generalise well. But what could this bias be? How does deep learning know how to generalise to inputs outside the training dataset?
These questions don’t have a complete answer yet, but there are several recent papers that make progress towards explaining this mystery. I plan to write about some of the main advances made so far.
See e.g. https://arxiv.org/pdf/1611.03530